from:"Vivek Goyal"

[PATCH v6 3/3] dax: Wake up all waiters after invalidating dax entry

2021-04-28 Thread Vivek Goyal

I am seeing missed wakeups which ultimately lead to a deadlock when I am
using virtiofs with DAX enabled and running "make -j". I had to mount
virtiofs as rootfs and also reduce to dax window size to 256M to reproduce
the problem consistently.

So here is the problem. put_unlocked_entry() wakes up waiters only
if entry is not null as well as !dax_is_conflict(entry). But if I
call multiple instances of invalidate_inode_pages2() in parallel,
then I can run into a situation where there are waiters on
this index but nobody will wake these waiters.

invalidate_inode_pages2()
  invalidate_inode_pages2_range()
invalidate_exceptional_entry2()
  dax_invalidate_mapping_entry_sync()
__dax_invalidate_entry() {
xas_lock_irq();
entry = get_unlocked_entry(, 0);
...
...
dax_disassociate_entry(entry, mapping, trunc);
xas_store(, NULL);
...
...
put_unlocked_entry(, entry);
xas_unlock_irq();
}

Say a fault in in progress and it has locked entry at offset say "0x1c".
Now say three instances of invalidate_inode_pages2() are in progress
(A, B, C) and they all try to invalidate entry at offset "0x1c". Given
dax entry is locked, all tree instances A, B, C will wait in wait queue.

When dax fault finishes, say A is woken up. It will store NULL entry
at index "0x1c" and wake up B. When B comes along it will find "entry=0"
at page offset 0x1c and it will call put_unlocked_entry(, 0). And
this means put_unlocked_entry() will not wake up next waiter, given
the current code. And that means C continues to wait and is not woken
up.

This patch fixes the issue by waking up all waiters when a dax entry
has been invalidated. This seems to fix the deadlock I am facing
and I can make forward progress.

Reported-by: Sergio Lopez 
Fixes: ac401cc78242 ("dax: New fault locking")
Reviewed-by: Jan Kara 
Suggested-by: Dan Williams 
Signed-off-by: Vivek Goyal 
---
 fs/dax.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/dax.c b/fs/dax.c
index 56eb1c759ca5..df5485b4bddf 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -675,7 +675,7 @@ static int __dax_invalidate_entry(struct address_space 
*mapping,
mapping->nrexceptional--;
ret = 1;
 out:
-   put_unlocked_entry(, entry, WAKE_NEXT);
+   put_unlocked_entry(, entry, WAKE_ALL);
xas_unlock_irq();
return ret;
 }
-- 
2.25.4
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

[PATCH v6 2/3] dax: Add a wakeup mode parameter to put_unlocked_entry()

2021-04-28 Thread Vivek Goyal

As of now put_unlocked_entry() always wakes up next waiter. In next
patches we want to wake up all waiters at one callsite. Hence, add a
parameter to the function.

This patch does not introduce any change of behavior.

Reviewed-by: Greg Kurz 
Reviewed-by: Jan Kara 
Suggested-by: Dan Williams 
Signed-off-by: Vivek Goyal 
---
 fs/dax.c | 14 +++---
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 5ecee51c44ee..56eb1c759ca5 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -275,11 +275,11 @@ static void wait_entry_unlocked(struct xa_state *xas, 
void *entry)
finish_wait(wq, );
 }
 
-static void put_unlocked_entry(struct xa_state *xas, void *entry)
+static void put_unlocked_entry(struct xa_state *xas, void *entry,
+  enum dax_wake_mode mode)
 {
-   /* If we were the only waiter woken, wake the next one */
if (entry && !dax_is_conflict(entry))
-   dax_wake_entry(xas, entry, WAKE_NEXT);
+   dax_wake_entry(xas, entry, mode);
 }
 
 /*
@@ -633,7 +633,7 @@ struct page *dax_layout_busy_page_range(struct 
address_space *mapping,
entry = get_unlocked_entry(, 0);
if (entry)
page = dax_busy_page(entry);
-   put_unlocked_entry(, entry);
+   put_unlocked_entry(, entry, WAKE_NEXT);
if (page)
break;
if (++scanned % XA_CHECK_SCHED)
@@ -675,7 +675,7 @@ static int __dax_invalidate_entry(struct address_space 
*mapping,
mapping->nrexceptional--;
ret = 1;
 out:
-   put_unlocked_entry(, entry);
+   put_unlocked_entry(, entry, WAKE_NEXT);
xas_unlock_irq();
return ret;
 }
@@ -954,7 +954,7 @@ static int dax_writeback_one(struct xa_state *xas, struct 
dax_device *dax_dev,
return ret;
 
  put_unlocked:
-   put_unlocked_entry(xas, entry);
+   put_unlocked_entry(xas, entry, WAKE_NEXT);
return ret;
 }
 
@@ -1695,7 +1695,7 @@ dax_insert_pfn_mkwrite(struct vm_fault *vmf, pfn_t pfn, 
unsigned int order)
/* Did we race with someone splitting entry or so? */
if (!entry || dax_is_conflict(entry) ||
(order == 0 && !dax_is_pte_entry(entry))) {
-   put_unlocked_entry(, entry);
+   put_unlocked_entry(, entry, WAKE_NEXT);
xas_unlock_irq();
trace_dax_insert_pfn_mkwrite_no_entry(mapping->host, vmf,
  VM_FAULT_NOPAGE);
-- 
2.25.4
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

[PATCH v6 1/3] dax: Add an enum for specifying dax wakup mode

2021-04-28 Thread Vivek Goyal

Dan mentioned that he is not very fond of passing around a boolean true/false
to specify if only next waiter should be woken up or all waiters should be
woken up. He instead prefers that we introduce an enum and make it very
explicity at the callsite itself. Easier to read code.

This patch should not introduce any change of behavior.

Reviewed-by: Greg Kurz 
Reviewed-by: Jan Kara 
Suggested-by: Dan Williams 
Signed-off-by: Vivek Goyal 
---
 fs/dax.c | 23 +--
 1 file changed, 17 insertions(+), 6 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index b3d27fdc6775..5ecee51c44ee 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -144,6 +144,16 @@ struct wait_exceptional_entry_queue {
struct exceptional_entry_key key;
 };
 
+/**
+ * enum dax_wake_mode: waitqueue wakeup behaviour
+ * @WAKE_ALL: wake all waiters in the waitqueue
+ * @WAKE_NEXT: wake only the first waiter in the waitqueue
+ */
+enum dax_wake_mode {
+   WAKE_ALL,
+   WAKE_NEXT,
+};
+
 static wait_queue_head_t *dax_entry_waitqueue(struct xa_state *xas,
void *entry, struct exceptional_entry_key *key)
 {
@@ -182,7 +192,8 @@ static int wake_exceptional_entry_func(wait_queue_entry_t 
*wait,
  * The important information it's conveying is whether the entry at
  * this index used to be a PMD entry.
  */
-static void dax_wake_entry(struct xa_state *xas, void *entry, bool wake_all)
+static void dax_wake_entry(struct xa_state *xas, void *entry,
+  enum dax_wake_mode mode)
 {
struct exceptional_entry_key key;
wait_queue_head_t *wq;
@@ -196,7 +207,7 @@ static void dax_wake_entry(struct xa_state *xas, void 
*entry, bool wake_all)
 * must be in the waitqueue and the following check will see them.
 */
if (waitqueue_active(wq))
-   __wake_up(wq, TASK_NORMAL, wake_all ? 0 : 1, );
+   __wake_up(wq, TASK_NORMAL, mode == WAKE_ALL ? 0 : 1, );
 }
 
 /*
@@ -268,7 +279,7 @@ static void put_unlocked_entry(struct xa_state *xas, void 
*entry)
 {
/* If we were the only waiter woken, wake the next one */
if (entry && !dax_is_conflict(entry))
-   dax_wake_entry(xas, entry, false);
+   dax_wake_entry(xas, entry, WAKE_NEXT);
 }
 
 /*
@@ -286,7 +297,7 @@ static void dax_unlock_entry(struct xa_state *xas, void 
*entry)
old = xas_store(xas, entry);
xas_unlock_irq(xas);
BUG_ON(!dax_is_locked(old));
-   dax_wake_entry(xas, entry, false);
+   dax_wake_entry(xas, entry, WAKE_NEXT);
 }
 
 /*
@@ -524,7 +535,7 @@ static void *grab_mapping_entry(struct xa_state *xas,
 
dax_disassociate_entry(entry, mapping, false);
xas_store(xas, NULL);   /* undo the PMD join */
-   dax_wake_entry(xas, entry, true);
+   dax_wake_entry(xas, entry, WAKE_ALL);
mapping->nrexceptional--;
entry = NULL;
xas_set(xas, index);
@@ -937,7 +948,7 @@ static int dax_writeback_one(struct xa_state *xas, struct 
dax_device *dax_dev,
xas_lock_irq(xas);
xas_store(xas, entry);
xas_clear_mark(xas, PAGECACHE_TAG_DIRTY);
-   dax_wake_entry(xas, entry, false);
+   dax_wake_entry(xas, entry, WAKE_NEXT);
 
trace_dax_writeback_one(mapping->host, index, count);
return ret;
-- 
2.25.4
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

[PATCH v6 0/3] dax: Fix missed wakeup in put_unlocked_entry()

2021-04-28 Thread Vivek Goyal

Hi,

This is V6. Only change since V5 is that I changed order of WAKE_NEXT
and WAKE_ALL in comments too.

Vivek

Vivek Goyal (3):
  dax: Add an enum for specifying dax wakup mode
  dax: Add a wakeup mode parameter to put_unlocked_entry()
  dax: Wake up all waiters after invalidating dax entry

 fs/dax.c | 35 +++
 1 file changed, 23 insertions(+), 12 deletions(-)

-- 
2.25.4
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH v5 1/3] dax: Add an enum for specifying dax wakup mode

2021-04-28 Thread Vivek Goyal

On Wed, Apr 28, 2021 at 12:50:38PM -0400, Vivek Goyal wrote:
> Dan mentioned that he is not very fond of passing around a boolean true/false
> to specify if only next waiter should be woken up or all waiters should be
> woken up. He instead prefers that we introduce an enum and make it very
> explicity at the callsite itself. Easier to read code.
> 
> This patch should not introduce any change of behavior.
> 
> Reviewed-by: Greg Kurz 
> Reviewed-by: Jan Kara 
> Suggested-by: Dan Williams 
> Signed-off-by: Vivek Goyal 
> ---
>  fs/dax.c | 23 +--
>  1 file changed, 17 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index b3d27fdc6775..c8cd2ae4440b 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -144,6 +144,16 @@ struct wait_exceptional_entry_queue {
>   struct exceptional_entry_key key;
>  };
>  
> +/**
> + * enum dax_wake_mode: waitqueue wakeup behaviour
> + * @WAKE_NEXT: wake only the first waiter in the waitqueue
> + * @WAKE_ALL: wake all waiters in the waitqueue
> + */

I just noticed that I did not change order in comments. Will post
another version. Sorry about the noise.

Vivek

> +enum dax_wake_mode {
> + WAKE_ALL,
> + WAKE_NEXT,
> +};
> +
>  static wait_queue_head_t *dax_entry_waitqueue(struct xa_state *xas,
>   void *entry, struct exceptional_entry_key *key)
>  {
> @@ -182,7 +192,8 @@ static int wake_exceptional_entry_func(wait_queue_entry_t 
> *wait,
>   * The important information it's conveying is whether the entry at
>   * this index used to be a PMD entry.
>   */
> -static void dax_wake_entry(struct xa_state *xas, void *entry, bool wake_all)
> +static void dax_wake_entry(struct xa_state *xas, void *entry,
> +enum dax_wake_mode mode)
>  {
>   struct exceptional_entry_key key;
>   wait_queue_head_t *wq;
> @@ -196,7 +207,7 @@ static void dax_wake_entry(struct xa_state *xas, void 
> *entry, bool wake_all)
>* must be in the waitqueue and the following check will see them.
>*/
>   if (waitqueue_active(wq))
> - __wake_up(wq, TASK_NORMAL, wake_all ? 0 : 1, );
> + __wake_up(wq, TASK_NORMAL, mode == WAKE_ALL ? 0 : 1, );
>  }
>  
>  /*
> @@ -268,7 +279,7 @@ static void put_unlocked_entry(struct xa_state *xas, void 
> *entry)
>  {
>   /* If we were the only waiter woken, wake the next one */
>   if (entry && !dax_is_conflict(entry))
> - dax_wake_entry(xas, entry, false);
> + dax_wake_entry(xas, entry, WAKE_NEXT);
>  }
>  
>  /*
> @@ -286,7 +297,7 @@ static void dax_unlock_entry(struct xa_state *xas, void 
> *entry)
>   old = xas_store(xas, entry);
>   xas_unlock_irq(xas);
>   BUG_ON(!dax_is_locked(old));
> - dax_wake_entry(xas, entry, false);
> + dax_wake_entry(xas, entry, WAKE_NEXT);
>  }
>  
>  /*
> @@ -524,7 +535,7 @@ static void *grab_mapping_entry(struct xa_state *xas,
>  
>   dax_disassociate_entry(entry, mapping, false);
>   xas_store(xas, NULL);   /* undo the PMD join */
> - dax_wake_entry(xas, entry, true);
> + dax_wake_entry(xas, entry, WAKE_ALL);
>   mapping->nrexceptional--;
>   entry = NULL;
>   xas_set(xas, index);
> @@ -937,7 +948,7 @@ static int dax_writeback_one(struct xa_state *xas, struct 
> dax_device *dax_dev,
>   xas_lock_irq(xas);
>   xas_store(xas, entry);
>   xas_clear_mark(xas, PAGECACHE_TAG_DIRTY);
> - dax_wake_entry(xas, entry, false);
> + dax_wake_entry(xas, entry, WAKE_NEXT);
>  
>   trace_dax_writeback_one(mapping->host, index, count);
>   return ret;
> -- 
> 2.25.4
> 
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

[PATCH v5 2/3] dax: Add a wakeup mode parameter to put_unlocked_entry()

2021-04-28 Thread Vivek Goyal

As of now put_unlocked_entry() always wakes up next waiter. In next
patches we want to wake up all waiters at one callsite. Hence, add a
parameter to the function.

This patch does not introduce any change of behavior.

Reviewed-by: Greg Kurz 
Reviewed-by: Jan Kara 
Suggested-by: Dan Williams 
Signed-off-by: Vivek Goyal 
---
 fs/dax.c | 14 +++---
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index c8cd2ae4440b..e84dd240c35c 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -275,11 +275,11 @@ static void wait_entry_unlocked(struct xa_state *xas, 
void *entry)
finish_wait(wq, );
 }
 
-static void put_unlocked_entry(struct xa_state *xas, void *entry)
+static void put_unlocked_entry(struct xa_state *xas, void *entry,
+  enum dax_wake_mode mode)
 {
-   /* If we were the only waiter woken, wake the next one */
if (entry && !dax_is_conflict(entry))
-   dax_wake_entry(xas, entry, WAKE_NEXT);
+   dax_wake_entry(xas, entry, mode);
 }
 
 /*
@@ -633,7 +633,7 @@ struct page *dax_layout_busy_page_range(struct 
address_space *mapping,
entry = get_unlocked_entry(, 0);
if (entry)
page = dax_busy_page(entry);
-   put_unlocked_entry(, entry);
+   put_unlocked_entry(, entry, WAKE_NEXT);
if (page)
break;
if (++scanned % XA_CHECK_SCHED)
@@ -675,7 +675,7 @@ static int __dax_invalidate_entry(struct address_space 
*mapping,
mapping->nrexceptional--;
ret = 1;
 out:
-   put_unlocked_entry(, entry);
+   put_unlocked_entry(, entry, WAKE_NEXT);
xas_unlock_irq();
return ret;
 }
@@ -954,7 +954,7 @@ static int dax_writeback_one(struct xa_state *xas, struct 
dax_device *dax_dev,
return ret;
 
  put_unlocked:
-   put_unlocked_entry(xas, entry);
+   put_unlocked_entry(xas, entry, WAKE_NEXT);
return ret;
 }
 
@@ -1695,7 +1695,7 @@ dax_insert_pfn_mkwrite(struct vm_fault *vmf, pfn_t pfn, 
unsigned int order)
/* Did we race with someone splitting entry or so? */
if (!entry || dax_is_conflict(entry) ||
(order == 0 && !dax_is_pte_entry(entry))) {
-   put_unlocked_entry(, entry);
+   put_unlocked_entry(, entry, WAKE_NEXT);
xas_unlock_irq();
trace_dax_insert_pfn_mkwrite_no_entry(mapping->host, vmf,
  VM_FAULT_NOPAGE);
-- 
2.25.4
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

[PATCH v5 3/3] dax: Wake up all waiters after invalidating dax entry

2021-04-28 Thread Vivek Goyal

I am seeing missed wakeups which ultimately lead to a deadlock when I am
using virtiofs with DAX enabled and running "make -j". I had to mount
virtiofs as rootfs and also reduce to dax window size to 256M to reproduce
the problem consistently.

So here is the problem. put_unlocked_entry() wakes up waiters only
if entry is not null as well as !dax_is_conflict(entry). But if I
call multiple instances of invalidate_inode_pages2() in parallel,
then I can run into a situation where there are waiters on
this index but nobody will wake these waiters.

invalidate_inode_pages2()
  invalidate_inode_pages2_range()
invalidate_exceptional_entry2()
  dax_invalidate_mapping_entry_sync()
__dax_invalidate_entry() {
xas_lock_irq();
entry = get_unlocked_entry(, 0);
...
...
dax_disassociate_entry(entry, mapping, trunc);
xas_store(, NULL);
...
...
put_unlocked_entry(, entry);
xas_unlock_irq();
}

Say a fault in in progress and it has locked entry at offset say "0x1c".
Now say three instances of invalidate_inode_pages2() are in progress
(A, B, C) and they all try to invalidate entry at offset "0x1c". Given
dax entry is locked, all tree instances A, B, C will wait in wait queue.

When dax fault finishes, say A is woken up. It will store NULL entry
at index "0x1c" and wake up B. When B comes along it will find "entry=0"
at page offset 0x1c and it will call put_unlocked_entry(, 0). And
this means put_unlocked_entry() will not wake up next waiter, given
the current code. And that means C continues to wait and is not woken
up.

This patch fixes the issue by waking up all waiters when a dax entry
has been invalidated. This seems to fix the deadlock I am facing
and I can make forward progress.

Reported-by: Sergio Lopez 
Fixes: ac401cc78242 ("dax: New fault locking")
Reviewed-by: Jan Kara 
Suggested-by: Dan Williams 
Signed-off-by: Vivek Goyal 
---
 fs/dax.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/dax.c b/fs/dax.c
index e84dd240c35c..42d7406b2cea 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -675,7 +675,7 @@ static int __dax_invalidate_entry(struct address_space 
*mapping,
mapping->nrexceptional--;
ret = 1;
 out:
-   put_unlocked_entry(, entry, WAKE_NEXT);
+   put_unlocked_entry(, entry, WAKE_ALL);
xas_unlock_irq();
return ret;
 }
-- 
2.25.4
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

[PATCH v5 1/3] dax: Add an enum for specifying dax wakup mode

2021-04-28 Thread Vivek Goyal

Dan mentioned that he is not very fond of passing around a boolean true/false
to specify if only next waiter should be woken up or all waiters should be
woken up. He instead prefers that we introduce an enum and make it very
explicity at the callsite itself. Easier to read code.

This patch should not introduce any change of behavior.

Reviewed-by: Greg Kurz 
Reviewed-by: Jan Kara 
Suggested-by: Dan Williams 
Signed-off-by: Vivek Goyal 
---
 fs/dax.c | 23 +--
 1 file changed, 17 insertions(+), 6 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index b3d27fdc6775..c8cd2ae4440b 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -144,6 +144,16 @@ struct wait_exceptional_entry_queue {
struct exceptional_entry_key key;
 };
 
+/**
+ * enum dax_wake_mode: waitqueue wakeup behaviour
+ * @WAKE_NEXT: wake only the first waiter in the waitqueue
+ * @WAKE_ALL: wake all waiters in the waitqueue
+ */
+enum dax_wake_mode {
+   WAKE_ALL,
+   WAKE_NEXT,
+};
+
 static wait_queue_head_t *dax_entry_waitqueue(struct xa_state *xas,
void *entry, struct exceptional_entry_key *key)
 {
@@ -182,7 +192,8 @@ static int wake_exceptional_entry_func(wait_queue_entry_t 
*wait,
  * The important information it's conveying is whether the entry at
  * this index used to be a PMD entry.
  */
-static void dax_wake_entry(struct xa_state *xas, void *entry, bool wake_all)
+static void dax_wake_entry(struct xa_state *xas, void *entry,
+  enum dax_wake_mode mode)
 {
struct exceptional_entry_key key;
wait_queue_head_t *wq;
@@ -196,7 +207,7 @@ static void dax_wake_entry(struct xa_state *xas, void 
*entry, bool wake_all)
 * must be in the waitqueue and the following check will see them.
 */
if (waitqueue_active(wq))
-   __wake_up(wq, TASK_NORMAL, wake_all ? 0 : 1, );
+   __wake_up(wq, TASK_NORMAL, mode == WAKE_ALL ? 0 : 1, );
 }
 
 /*
@@ -268,7 +279,7 @@ static void put_unlocked_entry(struct xa_state *xas, void 
*entry)
 {
/* If we were the only waiter woken, wake the next one */
if (entry && !dax_is_conflict(entry))
-   dax_wake_entry(xas, entry, false);
+   dax_wake_entry(xas, entry, WAKE_NEXT);
 }
 
 /*
@@ -286,7 +297,7 @@ static void dax_unlock_entry(struct xa_state *xas, void 
*entry)
old = xas_store(xas, entry);
xas_unlock_irq(xas);
BUG_ON(!dax_is_locked(old));
-   dax_wake_entry(xas, entry, false);
+   dax_wake_entry(xas, entry, WAKE_NEXT);
 }
 
 /*
@@ -524,7 +535,7 @@ static void *grab_mapping_entry(struct xa_state *xas,
 
dax_disassociate_entry(entry, mapping, false);
xas_store(xas, NULL);   /* undo the PMD join */
-   dax_wake_entry(xas, entry, true);
+   dax_wake_entry(xas, entry, WAKE_ALL);
mapping->nrexceptional--;
entry = NULL;
xas_set(xas, index);
@@ -937,7 +948,7 @@ static int dax_writeback_one(struct xa_state *xas, struct 
dax_device *dax_dev,
xas_lock_irq(xas);
xas_store(xas, entry);
xas_clear_mark(xas, PAGECACHE_TAG_DIRTY);
-   dax_wake_entry(xas, entry, false);
+   dax_wake_entry(xas, entry, WAKE_NEXT);
 
trace_dax_writeback_one(mapping->host, index, count);
return ret;
-- 
2.25.4
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

[PATCH v5 0/3] dax: Fix missed wakeup in put_unlocked_entry()

2021-04-28 Thread Vivek Goyal

Hi,

This is V5 of patches. Posted V4 here.

https://lore.kernel.org/linux-fsdevel/20210423130723.1673919-1-vgo...@redhat.com/

Changes since V4:

- Changed order of WAKE_NEXT and WAKE_ALL entries in enum. (Matthew Wilcox).

Thanks
Vivek

Vivek Goyal (3):
  dax: Add an enum for specifying dax wakup mode
  dax: Add a wakeup mode parameter to put_unlocked_entry()
  dax: Wake up all waiters after invalidating dax entry

 fs/dax.c | 35 +++
 1 file changed, 23 insertions(+), 12 deletions(-)

-- 
2.25.4
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH v4 1/3] dax: Add an enum for specifying dax wakup mode

2021-04-26 Thread Vivek Goyal

On Mon, Apr 26, 2021 at 07:02:11PM +0100, Matthew Wilcox wrote:
> On Mon, Apr 26, 2021 at 01:52:17PM -0400, Vivek Goyal wrote:
> > On Mon, Apr 26, 2021 at 02:46:32PM +0100, Matthew Wilcox wrote:
> > > On Fri, Apr 23, 2021 at 09:07:21AM -0400, Vivek Goyal wrote:
> > > > +enum dax_wake_mode {
> > > > +   WAKE_NEXT,
> > > > +   WAKE_ALL,
> > > > +};
> > > 
> > > Why define them in this order when ...
> > > 
> > > > @@ -196,7 +207,7 @@ static void dax_wake_entry(struct xa_state *xas, 
> > > > void *entry, bool wake_all)
> > > >  * must be in the waitqueue and the following check will see 
> > > > them.
> > > >  */
> > > > if (waitqueue_active(wq))
> > > > -   __wake_up(wq, TASK_NORMAL, wake_all ? 0 : 1, );
> > > > +   __wake_up(wq, TASK_NORMAL, mode == WAKE_ALL ? 0 : 1, 
> > > > );
> > > 
> > > ... they're used like this?  This is almost as bad as
> > > 
> > > enum bool {
> > >   true,
> > >   false,
> > > };
> > 
> > Hi Matthew,
> > 
> > So you prefer that I should switch order of WAKE_NEXT and WAKE_ALL? 
> > 
> > enum dax_wake_mode {
> > WAKE_ALL,
> > WAKE_NEXT,
> > };
> 
> That, yes.
> 
> > And then do following to wake task.
> > 
> > if (waitqueue_active(wq))
> > __wake_up(wq, TASK_NORMAL, mode, );
> 
> No, the third argument to __wake_up() is a count, not an enum.  It just so
> happens that '0' means 'all' and we only ever wake up 1 and not, say, 5.
> So the logical way to define the enum is ALL, NEXT which _just happens
> to match_ the usage of __wake_up().

Ok, In that case, I will retain existing code.

__wake_up(wq, TASK_NORMAL, mode == WAKE_ALL ? 0 : 1, );

Vivek
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH v4 1/3] dax: Add an enum for specifying dax wakup mode

2021-04-26 Thread Vivek Goyal

On Mon, Apr 26, 2021 at 02:46:32PM +0100, Matthew Wilcox wrote:
> On Fri, Apr 23, 2021 at 09:07:21AM -0400, Vivek Goyal wrote:
> > +enum dax_wake_mode {
> > +   WAKE_NEXT,
> > +   WAKE_ALL,
> > +};
> 
> Why define them in this order when ...
> 
> > @@ -196,7 +207,7 @@ static void dax_wake_entry(struct xa_state *xas, void 
> > *entry, bool wake_all)
> >  * must be in the waitqueue and the following check will see them.
> >  */
> > if (waitqueue_active(wq))
> > -   __wake_up(wq, TASK_NORMAL, wake_all ? 0 : 1, );
> > +   __wake_up(wq, TASK_NORMAL, mode == WAKE_ALL ? 0 : 1, );
> 
> ... they're used like this?  This is almost as bad as
> 
> enum bool {
>   true,
>   false,
> };

Hi Matthew,

So you prefer that I should switch order of WAKE_NEXT and WAKE_ALL? 

enum dax_wake_mode {
WAKE_ALL,
WAKE_NEXT,
};


And then do following to wake task.

if (waitqueue_active(wq))
__wake_up(wq, TASK_NORMAL, mode, );

I am fine with this if you like this better.

Or you are suggesting that don't introduce "enum dax_wake_mode" to
begin with.

Vivek
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

[PATCH v4 1/3] dax: Add an enum for specifying dax wakup mode

2021-04-23 Thread Vivek Goyal

Dan mentioned that he is not very fond of passing around a boolean true/false
to specify if only next waiter should be woken up or all waiters should be
woken up. He instead prefers that we introduce an enum and make it very
explicity at the callsite itself. Easier to read code.

This patch should not introduce any change of behavior.

Reviewed-by: Greg Kurz 
Reviewed-by: Jan Kara 
Suggested-by: Dan Williams 
Signed-off-by: Vivek Goyal 
---
 fs/dax.c | 23 +--
 1 file changed, 17 insertions(+), 6 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index b3d27fdc6775..4b1918b9ad97 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -144,6 +144,16 @@ struct wait_exceptional_entry_queue {
struct exceptional_entry_key key;
 };
 
+/**
+ * enum dax_wake_mode: waitqueue wakeup behaviour
+ * @WAKE_NEXT: wake only the first waiter in the waitqueue
+ * @WAKE_ALL: wake all waiters in the waitqueue
+ */
+enum dax_wake_mode {
+   WAKE_NEXT,
+   WAKE_ALL,
+};
+
 static wait_queue_head_t *dax_entry_waitqueue(struct xa_state *xas,
void *entry, struct exceptional_entry_key *key)
 {
@@ -182,7 +192,8 @@ static int wake_exceptional_entry_func(wait_queue_entry_t 
*wait,
  * The important information it's conveying is whether the entry at
  * this index used to be a PMD entry.
  */
-static void dax_wake_entry(struct xa_state *xas, void *entry, bool wake_all)
+static void dax_wake_entry(struct xa_state *xas, void *entry,
+  enum dax_wake_mode mode)
 {
struct exceptional_entry_key key;
wait_queue_head_t *wq;
@@ -196,7 +207,7 @@ static void dax_wake_entry(struct xa_state *xas, void 
*entry, bool wake_all)
 * must be in the waitqueue and the following check will see them.
 */
if (waitqueue_active(wq))
-   __wake_up(wq, TASK_NORMAL, wake_all ? 0 : 1, );
+   __wake_up(wq, TASK_NORMAL, mode == WAKE_ALL ? 0 : 1, );
 }
 
 /*
@@ -268,7 +279,7 @@ static void put_unlocked_entry(struct xa_state *xas, void 
*entry)
 {
/* If we were the only waiter woken, wake the next one */
if (entry && !dax_is_conflict(entry))
-   dax_wake_entry(xas, entry, false);
+   dax_wake_entry(xas, entry, WAKE_NEXT);
 }
 
 /*
@@ -286,7 +297,7 @@ static void dax_unlock_entry(struct xa_state *xas, void 
*entry)
old = xas_store(xas, entry);
xas_unlock_irq(xas);
BUG_ON(!dax_is_locked(old));
-   dax_wake_entry(xas, entry, false);
+   dax_wake_entry(xas, entry, WAKE_NEXT);
 }
 
 /*
@@ -524,7 +535,7 @@ static void *grab_mapping_entry(struct xa_state *xas,
 
dax_disassociate_entry(entry, mapping, false);
xas_store(xas, NULL);   /* undo the PMD join */
-   dax_wake_entry(xas, entry, true);
+   dax_wake_entry(xas, entry, WAKE_ALL);
mapping->nrexceptional--;
entry = NULL;
xas_set(xas, index);
@@ -937,7 +948,7 @@ static int dax_writeback_one(struct xa_state *xas, struct 
dax_device *dax_dev,
xas_lock_irq(xas);
xas_store(xas, entry);
xas_clear_mark(xas, PAGECACHE_TAG_DIRTY);
-   dax_wake_entry(xas, entry, false);
+   dax_wake_entry(xas, entry, WAKE_NEXT);
 
trace_dax_writeback_one(mapping->host, index, count);
return ret;
-- 
2.25.4
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

[PATCH v4 3/3] dax: Wake up all waiters after invalidating dax entry

2021-04-23 Thread Vivek Goyal

I am seeing missed wakeups which ultimately lead to a deadlock when I am
using virtiofs with DAX enabled and running "make -j". I had to mount
virtiofs as rootfs and also reduce to dax window size to 256M to reproduce
the problem consistently.

So here is the problem. put_unlocked_entry() wakes up waiters only
if entry is not null as well as !dax_is_conflict(entry). But if I
call multiple instances of invalidate_inode_pages2() in parallel,
then I can run into a situation where there are waiters on
this index but nobody will wake these waiters.

invalidate_inode_pages2()
  invalidate_inode_pages2_range()
invalidate_exceptional_entry2()
  dax_invalidate_mapping_entry_sync()
__dax_invalidate_entry() {
xas_lock_irq();
entry = get_unlocked_entry(, 0);
...
...
dax_disassociate_entry(entry, mapping, trunc);
xas_store(, NULL);
...
...
put_unlocked_entry(, entry);
xas_unlock_irq();
}

Say a fault in in progress and it has locked entry at offset say "0x1c".
Now say three instances of invalidate_inode_pages2() are in progress
(A, B, C) and they all try to invalidate entry at offset "0x1c". Given
dax entry is locked, all tree instances A, B, C will wait in wait queue.

When dax fault finishes, say A is woken up. It will store NULL entry
at index "0x1c" and wake up B. When B comes along it will find "entry=0"
at page offset 0x1c and it will call put_unlocked_entry(, 0). And
this means put_unlocked_entry() will not wake up next waiter, given
the current code. And that means C continues to wait and is not woken
up.

This patch fixes the issue by waking up all waiters when a dax entry
has been invalidated. This seems to fix the deadlock I am facing
and I can make forward progress.

Reported-by: Sergio Lopez 
Fixes: ac401cc78242 ("dax: New fault locking")
Reviewed-by: Jan Kara 
Suggested-by: Dan Williams 
Signed-off-by: Vivek Goyal 
---
 fs/dax.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/dax.c b/fs/dax.c
index 96e896de8f18..83daa57d37d3 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -675,7 +675,7 @@ static int __dax_invalidate_entry(struct address_space 
*mapping,
mapping->nrexceptional--;
ret = 1;
 out:
-   put_unlocked_entry(, entry, WAKE_NEXT);
+   put_unlocked_entry(, entry, WAKE_ALL);
xas_unlock_irq();
return ret;
 }
-- 
2.25.4
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

[PATCH v4 2/3] dax: Add a wakeup mode parameter to put_unlocked_entry()

2021-04-23 Thread Vivek Goyal

As of now put_unlocked_entry() always wakes up next waiter. In next
patches we want to wake up all waiters at one callsite. Hence, add a
parameter to the function.

This patch does not introduce any change of behavior.

Reviewed-by: Greg Kurz 
Reviewed-by: Jan Kara 
Suggested-by: Dan Williams 
Signed-off-by: Vivek Goyal 
---
 fs/dax.c | 14 +++---
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 4b1918b9ad97..96e896de8f18 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -275,11 +275,11 @@ static void wait_entry_unlocked(struct xa_state *xas, 
void *entry)
finish_wait(wq, );
 }
 
-static void put_unlocked_entry(struct xa_state *xas, void *entry)
+static void put_unlocked_entry(struct xa_state *xas, void *entry,
+  enum dax_wake_mode mode)
 {
-   /* If we were the only waiter woken, wake the next one */
if (entry && !dax_is_conflict(entry))
-   dax_wake_entry(xas, entry, WAKE_NEXT);
+   dax_wake_entry(xas, entry, mode);
 }
 
 /*
@@ -633,7 +633,7 @@ struct page *dax_layout_busy_page_range(struct 
address_space *mapping,
entry = get_unlocked_entry(, 0);
if (entry)
page = dax_busy_page(entry);
-   put_unlocked_entry(, entry);
+   put_unlocked_entry(, entry, WAKE_NEXT);
if (page)
break;
if (++scanned % XA_CHECK_SCHED)
@@ -675,7 +675,7 @@ static int __dax_invalidate_entry(struct address_space 
*mapping,
mapping->nrexceptional--;
ret = 1;
 out:
-   put_unlocked_entry(, entry);
+   put_unlocked_entry(, entry, WAKE_NEXT);
xas_unlock_irq();
return ret;
 }
@@ -954,7 +954,7 @@ static int dax_writeback_one(struct xa_state *xas, struct 
dax_device *dax_dev,
return ret;
 
  put_unlocked:
-   put_unlocked_entry(xas, entry);
+   put_unlocked_entry(xas, entry, WAKE_NEXT);
return ret;
 }
 
@@ -1695,7 +1695,7 @@ dax_insert_pfn_mkwrite(struct vm_fault *vmf, pfn_t pfn, 
unsigned int order)
/* Did we race with someone splitting entry or so? */
if (!entry || dax_is_conflict(entry) ||
(order == 0 && !dax_is_pte_entry(entry))) {
-   put_unlocked_entry(, entry);
+   put_unlocked_entry(, entry, WAKE_NEXT);
xas_unlock_irq();
trace_dax_insert_pfn_mkwrite_no_entry(mapping->host, vmf,
  VM_FAULT_NOPAGE);
-- 
2.25.4
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

[PATCH v4 0/3] dax: Fix missed wakeup in put_unlocked_entry()

2021-04-23 Thread Vivek Goyal

Hi,

This is V4 of the patches. Posted V3 here.

https://lore.kernel.org/linux-fsdevel/20210419213636.1514816-1-vgo...@redhat.com/

Changes since V3 are.

- Renamed "enum dax_entry_wake_mode" to "enum dax_wake_mode" (Matthew Wilcox)
- Changed description of WAKE_NEXT and WAKE_ALL (Jan Kara) 
- Got rid of a comment (Greg Kurz)

Thanks
Vivek

Vivek Goyal (3):
  dax: Add an enum for specifying dax wakup mode
  dax: Add a wakeup mode parameter to put_unlocked_entry()
  dax: Wake up all waiters after invalidating dax entry

 fs/dax.c | 35 +++
 1 file changed, 23 insertions(+), 12 deletions(-)

-- 
2.25.4
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [Virtio-fs] [PATCH v3 2/3] dax: Add a wakeup mode parameter to put_unlocked_entry()

2021-04-22 Thread Vivek Goyal

On Thu, Apr 22, 2021 at 01:01:15PM -0700, Dan Williams wrote:
> On Wed, Apr 21, 2021 at 11:25 PM Christoph Hellwig  wrote:
> >
> > On Wed, Apr 21, 2021 at 12:09:54PM -0700, Dan Williams wrote:
> > > Can you get in the habit of not replying inline with new patches like
> > > this? Collect the review feedback, take a pause, and resend the full
> > > series so tooling like b4 and patchwork can track when a new posting
> > > supersedes a previous one. As is, this inline style inflicts manual
> > > effort on the maintainer.
> >
> > Honestly I don't mind it at all.  If you shiny new tooling can't handle
> > it maybe you should fix your shiny new tooling instead of changing
> > everyones workflow?
> 
> I think asking a submitter to resend a series is par for the course,
> especially for poor saps like me burdened by corporate email systems.
> Vivek, if this is too onerous a request just give me a heads up and
> I'll manually pull out the patch content from your replies.

I am fine with posting new version. Initially I thought that there
were only 1-2 minor cleanup comments so I posted inline, thinking
it might preferred method instead of posting full patch series again.

But then more comments came along. So posting another version makes
more sense now.

Thanks
Vivek
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [Virtio-fs] [PATCH v3 2/3] dax: Add a wakeup mode parameter to put_unlocked_entry()

2021-04-21 Thread Vivek Goyal

On Wed, Apr 21, 2021 at 12:09:54PM -0700, Dan Williams wrote:
> On Tue, Apr 20, 2021 at 7:01 AM Vivek Goyal  wrote:
> >
> > On Tue, Apr 20, 2021 at 09:34:20AM +0200, Greg Kurz wrote:
> > > On Mon, 19 Apr 2021 17:36:35 -0400
> > > Vivek Goyal  wrote:
> > >
> > > > As of now put_unlocked_entry() always wakes up next waiter. In next
> > > > patches we want to wake up all waiters at one callsite. Hence, add a
> > > > parameter to the function.
> > > >
> > > > This patch does not introduce any change of behavior.
> > > >
> > > > Suggested-by: Dan Williams 
> > > > Signed-off-by: Vivek Goyal 
> > > > ---
> > > >  fs/dax.c | 13 +++--
> > > >  1 file changed, 7 insertions(+), 6 deletions(-)
> > > >
> > > > diff --git a/fs/dax.c b/fs/dax.c
> > > > index 00978d0838b1..f19d76a6a493 100644
> > > > --- a/fs/dax.c
> > > > +++ b/fs/dax.c
> > > > @@ -275,11 +275,12 @@ static void wait_entry_unlocked(struct xa_state 
> > > > *xas, void *entry)
> > > > finish_wait(wq, );
> > > >  }
> > > >
> > > > -static void put_unlocked_entry(struct xa_state *xas, void *entry)
> > > > +static void put_unlocked_entry(struct xa_state *xas, void *entry,
> > > > +  enum dax_entry_wake_mode mode)
> > > >  {
> > > > /* If we were the only waiter woken, wake the next one */
> > >
> > > With this change, the comment is no longer accurate since the
> > > function can now wake all waiters if passed mode == WAKE_ALL.
> > > Also, it paraphrases the code which is simple enough, so I'd
> > > simply drop it.
> > >
> > > This is minor though and it shouldn't prevent this fix to go
> > > forward.
> > >
> > > Reviewed-by: Greg Kurz 
> >
> > Ok, here is the updated patch which drops that comment line.
> >
> > Vivek
> 
> Hi Vivek,
> 
> Can you get in the habit of not replying inline with new patches like
> this? Collect the review feedback, take a pause, and resend the full
> series so tooling like b4 and patchwork can track when a new posting
> supersedes a previous one. As is, this inline style inflicts manual
> effort on the maintainer.

Hi Dan,

Sure. I will avoid doing this updated inline patch style. I will post new
version of patch series. 

Thanks
Vivek

> 
> >
> > Subject: dax: Add a wakeup mode parameter to put_unlocked_entry()
> >
> > As of now put_unlocked_entry() always wakes up next waiter. In next
> > patches we want to wake up all waiters at one callsite. Hence, add a
> > parameter to the function.
> >
> > This patch does not introduce any change of behavior.
> >
> > Suggested-by: Dan Williams 
> > Signed-off-by: Vivek Goyal 
> > ---
> >  fs/dax.c |   14 +++---
> >  1 file changed, 7 insertions(+), 7 deletions(-)
> >
> > Index: redhat-linux/fs/dax.c
> > ===
> > --- redhat-linux.orig/fs/dax.c  2021-04-20 09:55:45.105069893 -0400
> > +++ redhat-linux/fs/dax.c   2021-04-20 09:56:27.685822730 -0400
> > @@ -275,11 +275,11 @@ static void wait_entry_unlocked(struct x
> > finish_wait(wq, );
> >  }
> >
> > -static void put_unlocked_entry(struct xa_state *xas, void *entry)
> > +static void put_unlocked_entry(struct xa_state *xas, void *entry,
> > +  enum dax_entry_wake_mode mode)
> >  {
> > -   /* If we were the only waiter woken, wake the next one */
> > if (entry && !dax_is_conflict(entry))
> > -   dax_wake_entry(xas, entry, WAKE_NEXT);
> > +   dax_wake_entry(xas, entry, mode);
> >  }
> >
> >  /*
> > @@ -633,7 +633,7 @@ struct page *dax_layout_busy_page_range(
> > entry = get_unlocked_entry(, 0);
> > if (entry)
> > page = dax_busy_page(entry);
> > -   put_unlocked_entry(, entry);
> > +   put_unlocked_entry(, entry, WAKE_NEXT);
> > if (page)
> > break;
> > if (++scanned % XA_CHECK_SCHED)
> > @@ -675,7 +675,7 @@ static int __dax_invalidate_entry(struct
> > mapping->nrexceptional--;
> > ret = 1;
> >  out:
> > -   put_unlocked_entry(, entry);
> > +   put_unlocked_ent

Re: [Virtio-fs] [PATCH v3 2/3] dax: Add a wakeup mode parameter to put_unlocked_entry()

2021-04-21 Thread Vivek Goyal

On Tue, Apr 20, 2021 at 09:34:20AM +0200, Greg Kurz wrote:
> On Mon, 19 Apr 2021 17:36:35 -0400
> Vivek Goyal  wrote:
> 
> > As of now put_unlocked_entry() always wakes up next waiter. In next
> > patches we want to wake up all waiters at one callsite. Hence, add a
> > parameter to the function.
> > 
> > This patch does not introduce any change of behavior.
> > 
> > Suggested-by: Dan Williams 
> > Signed-off-by: Vivek Goyal 
> > ---
> >  fs/dax.c | 13 +++--
> >  1 file changed, 7 insertions(+), 6 deletions(-)
> > 
> > diff --git a/fs/dax.c b/fs/dax.c
> > index 00978d0838b1..f19d76a6a493 100644
> > --- a/fs/dax.c
> > +++ b/fs/dax.c
> > @@ -275,11 +275,12 @@ static void wait_entry_unlocked(struct xa_state *xas, 
> > void *entry)
> > finish_wait(wq, );
> >  }
> >  
> > -static void put_unlocked_entry(struct xa_state *xas, void *entry)
> > +static void put_unlocked_entry(struct xa_state *xas, void *entry,
> > +  enum dax_entry_wake_mode mode)
> >  {
> > /* If we were the only waiter woken, wake the next one */
> 
> With this change, the comment is no longer accurate since the
> function can now wake all waiters if passed mode == WAKE_ALL.
> Also, it paraphrases the code which is simple enough, so I'd
> simply drop it.

Ok, I will get rid of this comment. Agreed that code is simple
enough. And frankly speaking I don't even understand "If we were the
only waiter woken" part. How do we know that only this caller
was woken.

Vivek

> 
> This is minor though and it shouldn't prevent this fix to go
> forward.
> 
> Reviewed-by: Greg Kurz 
> 
> > if (entry && !dax_is_conflict(entry))
> > -   dax_wake_entry(xas, entry, WAKE_NEXT);
> > +   dax_wake_entry(xas, entry, mode);
> >  }
> >  
> >  /*
> > @@ -633,7 +634,7 @@ struct page *dax_layout_busy_page_range(struct 
> > address_space *mapping,
> > entry = get_unlocked_entry(, 0);
> > if (entry)
> > page = dax_busy_page(entry);
> > -   put_unlocked_entry(, entry);
> > +   put_unlocked_entry(, entry, WAKE_NEXT);
> > if (page)
> > break;
> > if (++scanned % XA_CHECK_SCHED)
> > @@ -675,7 +676,7 @@ static int __dax_invalidate_entry(struct address_space 
> > *mapping,
> > mapping->nrexceptional--;
> > ret = 1;
> >  out:
> > -   put_unlocked_entry(, entry);
> > +   put_unlocked_entry(, entry, WAKE_NEXT);
> > xas_unlock_irq();
> > return ret;
> >  }
> > @@ -954,7 +955,7 @@ static int dax_writeback_one(struct xa_state *xas, 
> > struct dax_device *dax_dev,
> > return ret;
> >  
> >   put_unlocked:
> > -   put_unlocked_entry(xas, entry);
> > +   put_unlocked_entry(xas, entry, WAKE_NEXT);
> > return ret;
> >  }
> >  
> > @@ -1695,7 +1696,7 @@ dax_insert_pfn_mkwrite(struct vm_fault *vmf, pfn_t 
> > pfn, unsigned int order)
> > /* Did we race with someone splitting entry or so? */
> > if (!entry || dax_is_conflict(entry) ||
> > (order == 0 && !dax_is_pte_entry(entry))) {
> > -   put_unlocked_entry(, entry);
> > +   put_unlocked_entry(, entry, WAKE_NEXT);
> > xas_unlock_irq();
> > trace_dax_insert_pfn_mkwrite_no_entry(mapping->host, vmf,
> >   VM_FAULT_NOPAGE);
> 
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH v3 1/3] dax: Add an enum for specifying dax wakup mode

2021-04-21 Thread Vivek Goyal

On Wed, Apr 21, 2021 at 05:16:24PM +0100, Matthew Wilcox wrote:
> On Wed, Apr 21, 2021 at 11:56:31AM -0400, Vivek Goyal wrote:
> > +/**
> > + * enum dax_entry_wake_mode: waitqueue wakeup toggle
> 
> s/toggle/behaviour/ ?

Will do.

> 
> > + * @WAKE_NEXT: wake only the first waiter in the waitqueue
> > + * @WAKE_ALL: wake all waiters in the waitqueue
> > + */
> > +enum dax_entry_wake_mode {
> > +   WAKE_NEXT,
> > +   WAKE_ALL,
> > +};
> > +
> >  static wait_queue_head_t *dax_entry_waitqueue(struct xa_state *xas,
> > void *entry, struct exceptional_entry_key *key)
> >  {
> > @@ -182,7 +192,8 @@ static int wake_exceptional_entry_func(w
> >   * The important information it's conveying is whether the entry at
> >   * this index used to be a PMD entry.
> >   */
> > -static void dax_wake_entry(struct xa_state *xas, void *entry, bool 
> > wake_all)
> > +static void dax_wake_entry(struct xa_state *xas, void *entry,
> > +  enum dax_entry_wake_mode mode)
> 
> It's an awfully verbose name.  'dax_wake_mode'?

Sure. Will change.

Vivek
> 
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH v3 1/3] dax: Add an enum for specifying dax wakup mode

2021-04-21 Thread Vivek Goyal

On Wed, Apr 21, 2021 at 11:24:40AM +0200, Jan Kara wrote:
> On Mon 19-04-21 17:36:34, Vivek Goyal wrote:
> > Dan mentioned that he is not very fond of passing around a boolean 
> > true/false
> > to specify if only next waiter should be woken up or all waiters should be
> > woken up. He instead prefers that we introduce an enum and make it very
> > explicity at the callsite itself. Easier to read code.
> > 
> > This patch should not introduce any change of behavior.
> > 
> > Suggested-by: Dan Williams 
> > Signed-off-by: Vivek Goyal 
> > ---
> >  fs/dax.c | 23 +--
> >  1 file changed, 17 insertions(+), 6 deletions(-)
> > 
> > diff --git a/fs/dax.c b/fs/dax.c
> > index b3d27fdc6775..00978d0838b1 100644
> > --- a/fs/dax.c
> > +++ b/fs/dax.c
> > @@ -144,6 +144,16 @@ struct wait_exceptional_entry_queue {
> > struct exceptional_entry_key key;
> >  };
> >  
> > +/**
> > + * enum dax_entry_wake_mode: waitqueue wakeup toggle
> > + * @WAKE_NEXT: entry was not mutated
> > + * @WAKE_ALL: entry was invalidated, or resized
> 
> Let's document the constants in terms of what they do, not when they are
> expected to be called. So something like:
> 
> @WAKE_NEXT: wake only the first waiter in the waitqueue
> @WAKE_ALL: wake all waiters in the waitqueue
> 
> Otherwise the patch looks good so feel free to add:
> 
> Reviewed-by: Jan Kara 
> 

Hi Jan,

Here is the updated patch based on your feedback.

Thanks
Vivek


Subject: dax: Add an enum for specifying dax wakup mode

Dan mentioned that he is not very fond of passing around a boolean true/false
to specify if only next waiter should be woken up or all waiters should be
woken up. He instead prefers that we introduce an enum and make it very
explicity at the callsite itself. Easier to read code.

This patch should not introduce any change of behavior.

Suggested-by: Dan Williams 
Signed-off-by: Vivek Goyal 
---
 fs/dax.c |   23 +--
 1 file changed, 17 insertions(+), 6 deletions(-)

Index: redhat-linux/fs/dax.c
===
--- redhat-linux.orig/fs/dax.c  2021-04-21 11:51:04.716289502 -0400
+++ redhat-linux/fs/dax.c   2021-04-21 11:52:10.298010850 -0400
@@ -144,6 +144,16 @@ struct wait_exceptional_entry_queue {
struct exceptional_entry_key key;
 };
 
+/**
+ * enum dax_entry_wake_mode: waitqueue wakeup toggle
+ * @WAKE_NEXT: wake only the first waiter in the waitqueue
+ * @WAKE_ALL: wake all waiters in the waitqueue
+ */
+enum dax_entry_wake_mode {
+   WAKE_NEXT,
+   WAKE_ALL,
+};
+
 static wait_queue_head_t *dax_entry_waitqueue(struct xa_state *xas,
void *entry, struct exceptional_entry_key *key)
 {
@@ -182,7 +192,8 @@ static int wake_exceptional_entry_func(w
  * The important information it's conveying is whether the entry at
  * this index used to be a PMD entry.
  */
-static void dax_wake_entry(struct xa_state *xas, void *entry, bool wake_all)
+static void dax_wake_entry(struct xa_state *xas, void *entry,
+  enum dax_entry_wake_mode mode)
 {
struct exceptional_entry_key key;
wait_queue_head_t *wq;
@@ -196,7 +207,7 @@ static void dax_wake_entry(struct xa_sta
 * must be in the waitqueue and the following check will see them.
 */
if (waitqueue_active(wq))
-   __wake_up(wq, TASK_NORMAL, wake_all ? 0 : 1, );
+   __wake_up(wq, TASK_NORMAL, mode == WAKE_ALL ? 0 : 1, );
 }
 
 /*
@@ -268,7 +279,7 @@ static void put_unlocked_entry(struct xa
 {
/* If we were the only waiter woken, wake the next one */
if (entry && !dax_is_conflict(entry))
-   dax_wake_entry(xas, entry, false);
+   dax_wake_entry(xas, entry, WAKE_NEXT);
 }
 
 /*
@@ -286,7 +297,7 @@ static void dax_unlock_entry(struct xa_s
old = xas_store(xas, entry);
xas_unlock_irq(xas);
BUG_ON(!dax_is_locked(old));
-   dax_wake_entry(xas, entry, false);
+   dax_wake_entry(xas, entry, WAKE_NEXT);
 }
 
 /*
@@ -524,7 +535,7 @@ retry:
 
dax_disassociate_entry(entry, mapping, false);
xas_store(xas, NULL);   /* undo the PMD join */
-   dax_wake_entry(xas, entry, true);
+   dax_wake_entry(xas, entry, WAKE_ALL);
mapping->nrexceptional--;
entry = NULL;
xas_set(xas, index);
@@ -937,7 +948,7 @@ static int dax_writeback_one(struct xa_s
xas_lock_irq(xas);
xas_store(xas, entry);
xas_clear_mark(xas, PAGECACHE_TAG_DIRTY);
-   dax_wake_entry(xas, entry, false);
+   dax_wake_entry(xas, entry, WAKE_NEXT);
 
trace_dax_writeback_one(mapping->host, index, count);
return ret;
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [Virtio-fs] [PATCH v3 2/3] dax: Add a wakeup mode parameter to put_unlocked_entry()

2021-04-20 Thread Vivek Goyal

On Tue, Apr 20, 2021 at 09:34:20AM +0200, Greg Kurz wrote:
> On Mon, 19 Apr 2021 17:36:35 -0400
> Vivek Goyal  wrote:
> 
> > As of now put_unlocked_entry() always wakes up next waiter. In next
> > patches we want to wake up all waiters at one callsite. Hence, add a
> > parameter to the function.
> > 
> > This patch does not introduce any change of behavior.
> > 
> > Suggested-by: Dan Williams 
> > Signed-off-by: Vivek Goyal 
> > ---
> >  fs/dax.c | 13 +++--
> >  1 file changed, 7 insertions(+), 6 deletions(-)
> > 
> > diff --git a/fs/dax.c b/fs/dax.c
> > index 00978d0838b1..f19d76a6a493 100644
> > --- a/fs/dax.c
> > +++ b/fs/dax.c
> > @@ -275,11 +275,12 @@ static void wait_entry_unlocked(struct xa_state *xas, 
> > void *entry)
> > finish_wait(wq, );
> >  }
> >  
> > -static void put_unlocked_entry(struct xa_state *xas, void *entry)
> > +static void put_unlocked_entry(struct xa_state *xas, void *entry,
> > +  enum dax_entry_wake_mode mode)
> >  {
> > /* If we were the only waiter woken, wake the next one */
> 
> With this change, the comment is no longer accurate since the
> function can now wake all waiters if passed mode == WAKE_ALL.
> Also, it paraphrases the code which is simple enough, so I'd
> simply drop it.
> 
> This is minor though and it shouldn't prevent this fix to go
> forward.
> 
> Reviewed-by: Greg Kurz 

Ok, here is the updated patch which drops that comment line.

Vivek

Subject: dax: Add a wakeup mode parameter to put_unlocked_entry()

As of now put_unlocked_entry() always wakes up next waiter. In next
patches we want to wake up all waiters at one callsite. Hence, add a
parameter to the function.

This patch does not introduce any change of behavior.

Suggested-by: Dan Williams 
Signed-off-by: Vivek Goyal 
---
 fs/dax.c |   14 +++---
 1 file changed, 7 insertions(+), 7 deletions(-)

Index: redhat-linux/fs/dax.c
===
--- redhat-linux.orig/fs/dax.c  2021-04-20 09:55:45.105069893 -0400
+++ redhat-linux/fs/dax.c   2021-04-20 09:56:27.685822730 -0400
@@ -275,11 +275,11 @@ static void wait_entry_unlocked(struct x
finish_wait(wq, );
 }
 
-static void put_unlocked_entry(struct xa_state *xas, void *entry)
+static void put_unlocked_entry(struct xa_state *xas, void *entry,
+  enum dax_entry_wake_mode mode)
 {
-   /* If we were the only waiter woken, wake the next one */
if (entry && !dax_is_conflict(entry))
-   dax_wake_entry(xas, entry, WAKE_NEXT);
+   dax_wake_entry(xas, entry, mode);
 }
 
 /*
@@ -633,7 +633,7 @@ struct page *dax_layout_busy_page_range(
entry = get_unlocked_entry(, 0);
if (entry)
page = dax_busy_page(entry);
-   put_unlocked_entry(, entry);
+   put_unlocked_entry(, entry, WAKE_NEXT);
if (page)
break;
if (++scanned % XA_CHECK_SCHED)
@@ -675,7 +675,7 @@ static int __dax_invalidate_entry(struct
mapping->nrexceptional--;
ret = 1;
 out:
-   put_unlocked_entry(, entry);
+   put_unlocked_entry(, entry, WAKE_NEXT);
xas_unlock_irq();
return ret;
 }
@@ -954,7 +954,7 @@ static int dax_writeback_one(struct xa_s
return ret;
 
  put_unlocked:
-   put_unlocked_entry(xas, entry);
+   put_unlocked_entry(xas, entry, WAKE_NEXT);
return ret;
 }
 
@@ -1695,7 +1695,7 @@ dax_insert_pfn_mkwrite(struct vm_fault *
/* Did we race with someone splitting entry or so? */
if (!entry || dax_is_conflict(entry) ||
(order == 0 && !dax_is_pte_entry(entry))) {
-   put_unlocked_entry(, entry);
+   put_unlocked_entry(, entry, WAKE_NEXT);
xas_unlock_irq();
trace_dax_insert_pfn_mkwrite_no_entry(mapping->host, vmf,
  VM_FAULT_NOPAGE);
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

[PATCH v3 2/3] dax: Add a wakeup mode parameter to put_unlocked_entry()

2021-04-19 Thread Vivek Goyal

As of now put_unlocked_entry() always wakes up next waiter. In next
patches we want to wake up all waiters at one callsite. Hence, add a
parameter to the function.

This patch does not introduce any change of behavior.

Suggested-by: Dan Williams 
Signed-off-by: Vivek Goyal 
---
 fs/dax.c | 13 +++--
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 00978d0838b1..f19d76a6a493 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -275,11 +275,12 @@ static void wait_entry_unlocked(struct xa_state *xas, 
void *entry)
finish_wait(wq, );
 }
 
-static void put_unlocked_entry(struct xa_state *xas, void *entry)
+static void put_unlocked_entry(struct xa_state *xas, void *entry,
+  enum dax_entry_wake_mode mode)
 {
/* If we were the only waiter woken, wake the next one */
if (entry && !dax_is_conflict(entry))
-   dax_wake_entry(xas, entry, WAKE_NEXT);
+   dax_wake_entry(xas, entry, mode);
 }
 
 /*
@@ -633,7 +634,7 @@ struct page *dax_layout_busy_page_range(struct 
address_space *mapping,
entry = get_unlocked_entry(, 0);
if (entry)
page = dax_busy_page(entry);
-   put_unlocked_entry(, entry);
+   put_unlocked_entry(, entry, WAKE_NEXT);
if (page)
break;
if (++scanned % XA_CHECK_SCHED)
@@ -675,7 +676,7 @@ static int __dax_invalidate_entry(struct address_space 
*mapping,
mapping->nrexceptional--;
ret = 1;
 out:
-   put_unlocked_entry(, entry);
+   put_unlocked_entry(, entry, WAKE_NEXT);
xas_unlock_irq();
return ret;
 }
@@ -954,7 +955,7 @@ static int dax_writeback_one(struct xa_state *xas, struct 
dax_device *dax_dev,
return ret;
 
  put_unlocked:
-   put_unlocked_entry(xas, entry);
+   put_unlocked_entry(xas, entry, WAKE_NEXT);
return ret;
 }
 
@@ -1695,7 +1696,7 @@ dax_insert_pfn_mkwrite(struct vm_fault *vmf, pfn_t pfn, 
unsigned int order)
/* Did we race with someone splitting entry or so? */
if (!entry || dax_is_conflict(entry) ||
(order == 0 && !dax_is_pte_entry(entry))) {
-   put_unlocked_entry(, entry);
+   put_unlocked_entry(, entry, WAKE_NEXT);
xas_unlock_irq();
trace_dax_insert_pfn_mkwrite_no_entry(mapping->host, vmf,
  VM_FAULT_NOPAGE);
-- 
2.25.4
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

[PATCH v3 1/3] dax: Add an enum for specifying dax wakup mode

2021-04-19 Thread Vivek Goyal

Dan mentioned that he is not very fond of passing around a boolean true/false
to specify if only next waiter should be woken up or all waiters should be
woken up. He instead prefers that we introduce an enum and make it very
explicity at the callsite itself. Easier to read code.

This patch should not introduce any change of behavior.

Suggested-by: Dan Williams 
Signed-off-by: Vivek Goyal 
---
 fs/dax.c | 23 +--
 1 file changed, 17 insertions(+), 6 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index b3d27fdc6775..00978d0838b1 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -144,6 +144,16 @@ struct wait_exceptional_entry_queue {
struct exceptional_entry_key key;
 };
 
+/**
+ * enum dax_entry_wake_mode: waitqueue wakeup toggle
+ * @WAKE_NEXT: entry was not mutated
+ * @WAKE_ALL: entry was invalidated, or resized
+ */
+enum dax_entry_wake_mode {
+   WAKE_NEXT,
+   WAKE_ALL,
+};
+
 static wait_queue_head_t *dax_entry_waitqueue(struct xa_state *xas,
void *entry, struct exceptional_entry_key *key)
 {
@@ -182,7 +192,8 @@ static int wake_exceptional_entry_func(wait_queue_entry_t 
*wait,
  * The important information it's conveying is whether the entry at
  * this index used to be a PMD entry.
  */
-static void dax_wake_entry(struct xa_state *xas, void *entry, bool wake_all)
+static void dax_wake_entry(struct xa_state *xas, void *entry,
+  enum dax_entry_wake_mode mode)
 {
struct exceptional_entry_key key;
wait_queue_head_t *wq;
@@ -196,7 +207,7 @@ static void dax_wake_entry(struct xa_state *xas, void 
*entry, bool wake_all)
 * must be in the waitqueue and the following check will see them.
 */
if (waitqueue_active(wq))
-   __wake_up(wq, TASK_NORMAL, wake_all ? 0 : 1, );
+   __wake_up(wq, TASK_NORMAL, mode == WAKE_ALL ? 0 : 1, );
 }
 
 /*
@@ -268,7 +279,7 @@ static void put_unlocked_entry(struct xa_state *xas, void 
*entry)
 {
/* If we were the only waiter woken, wake the next one */
if (entry && !dax_is_conflict(entry))
-   dax_wake_entry(xas, entry, false);
+   dax_wake_entry(xas, entry, WAKE_NEXT);
 }
 
 /*
@@ -286,7 +297,7 @@ static void dax_unlock_entry(struct xa_state *xas, void 
*entry)
old = xas_store(xas, entry);
xas_unlock_irq(xas);
BUG_ON(!dax_is_locked(old));
-   dax_wake_entry(xas, entry, false);
+   dax_wake_entry(xas, entry, WAKE_NEXT);
 }
 
 /*
@@ -524,7 +535,7 @@ static void *grab_mapping_entry(struct xa_state *xas,
 
dax_disassociate_entry(entry, mapping, false);
xas_store(xas, NULL);   /* undo the PMD join */
-   dax_wake_entry(xas, entry, true);
+   dax_wake_entry(xas, entry, WAKE_ALL);
mapping->nrexceptional--;
entry = NULL;
xas_set(xas, index);
@@ -937,7 +948,7 @@ static int dax_writeback_one(struct xa_state *xas, struct 
dax_device *dax_dev,
xas_lock_irq(xas);
xas_store(xas, entry);
xas_clear_mark(xas, PAGECACHE_TAG_DIRTY);
-   dax_wake_entry(xas, entry, false);
+   dax_wake_entry(xas, entry, WAKE_NEXT);
 
trace_dax_writeback_one(mapping->host, index, count);
return ret;
-- 
2.25.4
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

[PATCH v3 3/3] dax: Wake up all waiters after invalidating dax entry

2021-04-19 Thread Vivek Goyal

I am seeing missed wakeups which ultimately lead to a deadlock when I am
using virtiofs with DAX enabled and running "make -j". I had to mount
virtiofs as rootfs and also reduce to dax window size to 256M to reproduce
the problem consistently.

So here is the problem. put_unlocked_entry() wakes up waiters only
if entry is not null as well as !dax_is_conflict(entry). But if I
call multiple instances of invalidate_inode_pages2() in parallel,
then I can run into a situation where there are waiters on
this index but nobody will wait these.

invalidate_inode_pages2()
  invalidate_inode_pages2_range()
invalidate_exceptional_entry2()
  dax_invalidate_mapping_entry_sync()
__dax_invalidate_entry() {
xas_lock_irq();
entry = get_unlocked_entry(, 0);
...
...
dax_disassociate_entry(entry, mapping, trunc);
xas_store(, NULL);
...
...
put_unlocked_entry(, entry);
xas_unlock_irq();
}

Say a fault in in progress and it has locked entry at offset say "0x1c".
Now say three instances of invalidate_inode_pages2() are in progress
(A, B, C) and they all try to invalidate entry at offset "0x1c". Given
dax entry is locked, all tree instances A, B, C will wait in wait queue.

When dax fault finishes, say A is woken up. It will store NULL entry
at index "0x1c" and wake up B. When B comes along it will find "entry=0"
at page offset 0x1c and it will call put_unlocked_entry(, 0). And
this means put_unlocked_entry() will not wake up next waiter, given
the current code. And that means C continues to wait and is not woken
up.

This patch fixes the issue by waking up all waiters when a dax entry
has been invalidated. This seems to fix the deadlock I am facing
and I can make forward progress.

Reported-by: Sergio Lopez 
Fixes: ac401cc78242 ("dax: New fault locking")
Suggested-by: Dan Williams 
Signed-off-by: Vivek Goyal 
---
 fs/dax.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/dax.c b/fs/dax.c
index f19d76a6a493..cc497519be83 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -676,7 +676,7 @@ static int __dax_invalidate_entry(struct address_space 
*mapping,
mapping->nrexceptional--;
ret = 1;
 out:
-   put_unlocked_entry(, entry, WAKE_NEXT);
+   put_unlocked_entry(, entry, WAKE_ALL);
xas_unlock_irq();
return ret;
 }
-- 
2.25.4
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

[PATCH v3 0/3] dax: Fix missed wakeup in put_unlocked_entry()

2021-04-19 Thread Vivek Goyal

Hi,

This is V3 of patches. Posted V2 here.

https://lore.kernel.org/linux-fsdevel/20210419184516.gc1472...@redhat.com/

Changes since v2:

- Broke down patch in to a patch series (Dan)
- Added an enum to communicate wake mode (Dan)

Thanks
Vivek

Vivek Goyal (3):
  dax: Add an enum for specifying dax wakup mode
  dax: Add a wakeup mode parameter to put_unlocked_entry()
  dax: Wake up all waiters after invalidating dax entry

 fs/dax.c | 34 +++---
 1 file changed, 23 insertions(+), 11 deletions(-)

-- 
2.25.4
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH][v2] dax: Fix missed wakeup during dax entry invalidation

2021-04-19 Thread Vivek Goyal

On Mon, Apr 19, 2021 at 04:39:47PM -0400, Vivek Goyal wrote:
> On Mon, Apr 19, 2021 at 12:48:58PM -0700, Dan Williams wrote:
> > On Mon, Apr 19, 2021 at 11:45 AM Vivek Goyal  wrote:
> > >
> > > This is V2 of the patch. Posted V1 here.
> > >
> > > https://lore.kernel.org/linux-fsdevel/20210416173524.ga1379...@redhat.com/
> > >
> > > Based on feedback from Dan and Jan, modified the patch to wake up
> > > all waiters when dax entry is invalidated. This solves the issues
> > > of missed wakeups.
> > 
> > Care to send a formal patch with this commentary moved below the --- line?
> > 
> > One style fixup below...
> > 
> > >
> > > I am seeing missed wakeups which ultimately lead to a deadlock when I am
> > > using virtiofs with DAX enabled and running "make -j". I had to mount
> > > virtiofs as rootfs and also reduce to dax window size to 256M to reproduce
> > > the problem consistently.
> > >
> > > So here is the problem. put_unlocked_entry() wakes up waiters only
> > > if entry is not null as well as !dax_is_conflict(entry). But if I
> > > call multiple instances of invalidate_inode_pages2() in parallel,
> > > then I can run into a situation where there are waiters on
> > > this index but nobody will wait these.
> > >
> > > invalidate_inode_pages2()
> > >   invalidate_inode_pages2_range()
> > > invalidate_exceptional_entry2()
> > >   dax_invalidate_mapping_entry_sync()
> > > __dax_invalidate_entry() {
> > > xas_lock_irq();
> > > entry = get_unlocked_entry(, 0);
> > > ...
> > > ...
> > > dax_disassociate_entry(entry, mapping, trunc);
> > > xas_store(, NULL);
> > > ...
> > > ...
> > > put_unlocked_entry(, entry);
> > > xas_unlock_irq();
> > > }
> > >
> > > Say a fault in in progress and it has locked entry at offset say "0x1c".
> > > Now say three instances of invalidate_inode_pages2() are in progress
> > > (A, B, C) and they all try to invalidate entry at offset "0x1c". Given
> > > dax entry is locked, all tree instances A, B, C will wait in wait queue.
> > >
> > > When dax fault finishes, say A is woken up. It will store NULL entry
> > > at index "0x1c" and wake up B. When B comes along it will find "entry=0"
> > > at page offset 0x1c and it will call put_unlocked_entry(, 0). And
> > > this means put_unlocked_entry() will not wake up next waiter, given
> > > the current code. And that means C continues to wait and is not woken
> > > up.
> > >
> > > This patch fixes the issue by waking up all waiters when a dax entry
> > > has been invalidated. This seems to fix the deadlock I am facing
> > > and I can make forward progress.
> > >
> > > Reported-by: Sergio Lopez 
> > > Signed-off-by: Vivek Goyal 
> > > ---
> > >  fs/dax.c |   12 ++--
> > >  1 file changed, 6 insertions(+), 6 deletions(-)
> > >
> > > Index: redhat-linux/fs/dax.c
> > > ===
> > > --- redhat-linux.orig/fs/dax.c  2021-04-16 14:16:44.332140543 -0400
> > > +++ redhat-linux/fs/dax.c   2021-04-19 11:24:11.465213474 -0400
> > > @@ -264,11 +264,11 @@ static void wait_entry_unlocked(struct x
> > > finish_wait(wq, );
> > >  }
> > >
> > > -static void put_unlocked_entry(struct xa_state *xas, void *entry)
> > > +static void put_unlocked_entry(struct xa_state *xas, void *entry, bool 
> > > wake_all)
> > >  {
> > > /* If we were the only waiter woken, wake the next one */
> > > if (entry && !dax_is_conflict(entry))
> > > -   dax_wake_entry(xas, entry, false);
> > > +   dax_wake_entry(xas, entry, wake_all);
> > >  }
> > >
> > >  /*
> > > @@ -622,7 +622,7 @@ struct page *dax_layout_busy_page_range(
> > > entry = get_unlocked_entry(, 0);
> > > if (entry)
> > > page = dax_busy_page(entry);
> > > -   put_unlocked_entry(, entry);
> > > +   put_unlocked_entry(, entry, false);
> > 
> > I'm not a fan of raw true/false arguments be

Re: [PATCH][v2] dax: Fix missed wakeup during dax entry invalidation

2021-04-19 Thread Vivek Goyal

On Mon, Apr 19, 2021 at 12:48:58PM -0700, Dan Williams wrote:
> On Mon, Apr 19, 2021 at 11:45 AM Vivek Goyal  wrote:
> >
> > This is V2 of the patch. Posted V1 here.
> >
> > https://lore.kernel.org/linux-fsdevel/20210416173524.ga1379...@redhat.com/
> >
> > Based on feedback from Dan and Jan, modified the patch to wake up
> > all waiters when dax entry is invalidated. This solves the issues
> > of missed wakeups.
> 
> Care to send a formal patch with this commentary moved below the --- line?
> 
> One style fixup below...
> 
> >
> > I am seeing missed wakeups which ultimately lead to a deadlock when I am
> > using virtiofs with DAX enabled and running "make -j". I had to mount
> > virtiofs as rootfs and also reduce to dax window size to 256M to reproduce
> > the problem consistently.
> >
> > So here is the problem. put_unlocked_entry() wakes up waiters only
> > if entry is not null as well as !dax_is_conflict(entry). But if I
> > call multiple instances of invalidate_inode_pages2() in parallel,
> > then I can run into a situation where there are waiters on
> > this index but nobody will wait these.
> >
> > invalidate_inode_pages2()
> >   invalidate_inode_pages2_range()
> > invalidate_exceptional_entry2()
> >   dax_invalidate_mapping_entry_sync()
> > __dax_invalidate_entry() {
> > xas_lock_irq();
> > entry = get_unlocked_entry(, 0);
> > ...
> > ...
> > dax_disassociate_entry(entry, mapping, trunc);
> > xas_store(, NULL);
> > ...
> > ...
> > put_unlocked_entry(, entry);
> > xas_unlock_irq();
> > }
> >
> > Say a fault in in progress and it has locked entry at offset say "0x1c".
> > Now say three instances of invalidate_inode_pages2() are in progress
> > (A, B, C) and they all try to invalidate entry at offset "0x1c". Given
> > dax entry is locked, all tree instances A, B, C will wait in wait queue.
> >
> > When dax fault finishes, say A is woken up. It will store NULL entry
> > at index "0x1c" and wake up B. When B comes along it will find "entry=0"
> > at page offset 0x1c and it will call put_unlocked_entry(, 0). And
> > this means put_unlocked_entry() will not wake up next waiter, given
> > the current code. And that means C continues to wait and is not woken
> > up.
> >
> > This patch fixes the issue by waking up all waiters when a dax entry
> > has been invalidated. This seems to fix the deadlock I am facing
> > and I can make forward progress.
> >
> > Reported-by: Sergio Lopez 
> > Signed-off-by: Vivek Goyal 
> > ---
> >  fs/dax.c |   12 ++--
> >  1 file changed, 6 insertions(+), 6 deletions(-)
> >
> > Index: redhat-linux/fs/dax.c
> > ===
> > --- redhat-linux.orig/fs/dax.c  2021-04-16 14:16:44.332140543 -0400
> > +++ redhat-linux/fs/dax.c   2021-04-19 11:24:11.465213474 -0400
> > @@ -264,11 +264,11 @@ static void wait_entry_unlocked(struct x
> > finish_wait(wq, );
> >  }
> >
> > -static void put_unlocked_entry(struct xa_state *xas, void *entry)
> > +static void put_unlocked_entry(struct xa_state *xas, void *entry, bool 
> > wake_all)
> >  {
> > /* If we were the only waiter woken, wake the next one */
> > if (entry && !dax_is_conflict(entry))
> > -   dax_wake_entry(xas, entry, false);
> > +   dax_wake_entry(xas, entry, wake_all);
> >  }
> >
> >  /*
> > @@ -622,7 +622,7 @@ struct page *dax_layout_busy_page_range(
> > entry = get_unlocked_entry(, 0);
> > if (entry)
> > page = dax_busy_page(entry);
> > -   put_unlocked_entry(, entry);
> > +   put_unlocked_entry(, entry, false);
> 
> I'm not a fan of raw true/false arguments because if you read this
> line in isolation you need to go read put_unlocked_entry() to recall
> what that argument means. So lets add something like:
> 
> /**
>  * enum dax_entry_wake_mode: waitqueue wakeup toggle
>  * @WAKE_NEXT: entry was not mutated
>  * @WAKE_ALL: entry was invalidated, or resized
>  */
> enum dax_entry_wake_mode {
> WAKE_NEXT,
> WAKE_ALL,
> }
> 
> ...and use that as the arg for dax_wake_entry(). So I'd expect this to
> be a 3 patch series

[PATCH][v2] dax: Fix missed wakeup during dax entry invalidation

2021-04-19 Thread Vivek Goyal

This is V2 of the patch. Posted V1 here.

https://lore.kernel.org/linux-fsdevel/20210416173524.ga1379...@redhat.com/

Based on feedback from Dan and Jan, modified the patch to wake up 
all waiters when dax entry is invalidated. This solves the issues
of missed wakeups.

I am seeing missed wakeups which ultimately lead to a deadlock when I am
using virtiofs with DAX enabled and running "make -j". I had to mount
virtiofs as rootfs and also reduce to dax window size to 256M to reproduce
the problem consistently.

So here is the problem. put_unlocked_entry() wakes up waiters only
if entry is not null as well as !dax_is_conflict(entry). But if I
call multiple instances of invalidate_inode_pages2() in parallel,
then I can run into a situation where there are waiters on
this index but nobody will wait these.

invalidate_inode_pages2()
  invalidate_inode_pages2_range()
invalidate_exceptional_entry2()
  dax_invalidate_mapping_entry_sync()
__dax_invalidate_entry() {
xas_lock_irq();
entry = get_unlocked_entry(, 0);
...
...
dax_disassociate_entry(entry, mapping, trunc);
xas_store(, NULL);
...
...
put_unlocked_entry(, entry);
xas_unlock_irq();
}

Say a fault in in progress and it has locked entry at offset say "0x1c".
Now say three instances of invalidate_inode_pages2() are in progress
(A, B, C) and they all try to invalidate entry at offset "0x1c". Given
dax entry is locked, all tree instances A, B, C will wait in wait queue.

When dax fault finishes, say A is woken up. It will store NULL entry
at index "0x1c" and wake up B. When B comes along it will find "entry=0"
at page offset 0x1c and it will call put_unlocked_entry(, 0). And
this means put_unlocked_entry() will not wake up next waiter, given
the current code. And that means C continues to wait and is not woken
up.

This patch fixes the issue by waking up all waiters when a dax entry
has been invalidated. This seems to fix the deadlock I am facing
and I can make forward progress.

Reported-by: Sergio Lopez 
Signed-off-by: Vivek Goyal 
---
 fs/dax.c |   12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

Index: redhat-linux/fs/dax.c
===
--- redhat-linux.orig/fs/dax.c  2021-04-16 14:16:44.332140543 -0400
+++ redhat-linux/fs/dax.c   2021-04-19 11:24:11.465213474 -0400
@@ -264,11 +264,11 @@ static void wait_entry_unlocked(struct x
finish_wait(wq, );
 }
 
-static void put_unlocked_entry(struct xa_state *xas, void *entry)
+static void put_unlocked_entry(struct xa_state *xas, void *entry, bool 
wake_all)
 {
/* If we were the only waiter woken, wake the next one */
if (entry && !dax_is_conflict(entry))
-   dax_wake_entry(xas, entry, false);
+   dax_wake_entry(xas, entry, wake_all);
 }
 
 /*
@@ -622,7 +622,7 @@ struct page *dax_layout_busy_page_range(
entry = get_unlocked_entry(, 0);
if (entry)
page = dax_busy_page(entry);
-   put_unlocked_entry(, entry);
+   put_unlocked_entry(, entry, false);
if (page)
break;
if (++scanned % XA_CHECK_SCHED)
@@ -664,7 +664,7 @@ static int __dax_invalidate_entry(struct
mapping->nrexceptional--;
ret = 1;
 out:
-   put_unlocked_entry(, entry);
+   put_unlocked_entry(, entry, true);
xas_unlock_irq();
return ret;
 }
@@ -943,7 +943,7 @@ static int dax_writeback_one(struct xa_s
return ret;
 
  put_unlocked:
-   put_unlocked_entry(xas, entry);
+   put_unlocked_entry(xas, entry, false);
return ret;
 }
 
@@ -1684,7 +1684,7 @@ dax_insert_pfn_mkwrite(struct vm_fault *
/* Did we race with someone splitting entry or so? */
if (!entry || dax_is_conflict(entry) ||
(order == 0 && !dax_is_pte_entry(entry))) {
-   put_unlocked_entry(, entry);
+   put_unlocked_entry(, entry, false);
xas_unlock_irq();
trace_dax_insert_pfn_mkwrite_no_entry(mapping->host, vmf,
  VM_FAULT_NOPAGE);
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH] dax: Fix missed wakeup in put_unlocked_entry()

2021-04-16 Thread Vivek Goyal

On Fri, Apr 16, 2021 at 12:56:05PM -0700, Dan Williams wrote:
> On Fri, Apr 16, 2021 at 10:35 AM Vivek Goyal  wrote:
> >
> > I am seeing missed wakeups which ultimately lead to a deadlock when I am
> > using virtiofs with DAX enabled and running "make -j". I had to mount
> > virtiofs as rootfs and also reduce to dax window size to 32M to reproduce
> > the problem consistently.
> >
> > This is not a complete patch. I am just proposing this partial fix to
> > highlight the issue and trying to figure out how it should be fixed.
> > Should it be fixed in generic dax code or should filesystem (fuse/virtiofs)
> > take care of this.
> >
> > So here is the problem. put_unlocked_entry() wakes up waiters only
> > if entry is not null as well as !dax_is_conflict(entry). But if I
> > call multiple instances of invalidate_inode_pages2() in parallel,
> > then I can run into a situation where there are waiters on
> > this index but nobody will wait these.
> >
> > invalidate_inode_pages2()
> >   invalidate_inode_pages2_range()
> > invalidate_exceptional_entry2()
> >   dax_invalidate_mapping_entry_sync()
> > __dax_invalidate_entry() {
> > xas_lock_irq();
> > entry = get_unlocked_entry(, 0);
> > ...
> > ...
> > dax_disassociate_entry(entry, mapping, trunc);
> > xas_store(, NULL);
> > ...
> > ...
> > put_unlocked_entry(, entry);
> > xas_unlock_irq();
> > }
> >
> > Say a fault in in progress and it has locked entry at offset say "0x1c".
> > Now say three instances of invalidate_inode_pages2() are in progress
> > (A, B, C) and they all try to invalidate entry at offset "0x1c". Given
> > dax entry is locked, all tree instances A, B, C will wait in wait queue.
> >
> > When dax fault finishes, say A is woken up. It will store NULL entry
> > at index "0x1c" and wake up B. When B comes along it will find "entry=0"
> > at page offset 0x1c and it will call put_unlocked_entry(, 0). And
> > this means put_unlocked_entry() will not wake up next waiter, given
> > the current code. And that means C continues to wait and is not woken
> > up.
> >
> > In my case I am seeing that dax page fault path itself is waiting
> > on grab_mapping_entry() and also invalidate_inode_page2() is
> > waiting in get_unlocked_entry() but entry has already been cleaned
> > up and nobody woke up these processes. Atleast I think that's what
> > is happening.
> >
> > This patch wakes up a process even if entry=0. And deadlock does not
> > happen. I am running into some OOM issues, that will debug.
> >
> > So my question is that is it a dax issue and should it be fixed in
> > dax layer. Or should it be handled in fuse to make sure that
> > multiple instances of invalidate_inode_pages2() on same inode
> > don't make progress in parallel and introduce enough locking
> > around it.
> >
> > Right now fuse_finish_open() calls invalidate_inode_pages2() without
> > any locking. That allows it to make progress in parallel to dax
> > fault path as well as allows multiple instances of invalidate_inode_pages2()
> > to run in parallel.
> >
> > Not-yet-signed-off-by: Vivek Goyal 
> > ---
> >  fs/dax.c |7 ---
> >  1 file changed, 4 insertions(+), 3 deletions(-)
> >
> > Index: redhat-linux/fs/dax.c
> > ===
> > --- redhat-linux.orig/fs/dax.c  2021-04-16 12:50:40.141363317 -0400
> > +++ redhat-linux/fs/dax.c   2021-04-16 12:51:42.385926390 -0400
> > @@ -266,9 +266,10 @@ static void wait_entry_unlocked(struct x
> >
> >  static void put_unlocked_entry(struct xa_state *xas, void *entry)
> >  {
> > -   /* If we were the only waiter woken, wake the next one */
> > -   if (entry && !dax_is_conflict(entry))
> > -   dax_wake_entry(xas, entry, false);
> > +   if (dax_is_conflict(entry))
> > +   return;
> > +
> > +   dax_wake_entry(xas, entry, false);
> 

Hi Dan,

> How does this work if entry is NULL? dax_entry_waitqueue() will not
> know if it needs to adjust the index.

Wake waiters both at current index as well PMD adjusted index. It feels
little ugly though.

> I think the fix might be to
> specify that put_unlocked_entry() in the invalidate path needs to do a
> wake_up_all().

Doing a wake_up_all() when we invalidate an entry, sounds good. I will give
it a try.

Thanks
Vivek
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

[PATCH] dax: Fix missed wakeup in put_unlocked_entry()

2021-04-16 Thread Vivek Goyal

I am seeing missed wakeups which ultimately lead to a deadlock when I am
using virtiofs with DAX enabled and running "make -j". I had to mount
virtiofs as rootfs and also reduce to dax window size to 32M to reproduce
the problem consistently.

This is not a complete patch. I am just proposing this partial fix to
highlight the issue and trying to figure out how it should be fixed.
Should it be fixed in generic dax code or should filesystem (fuse/virtiofs)
take care of this.

So here is the problem. put_unlocked_entry() wakes up waiters only
if entry is not null as well as !dax_is_conflict(entry). But if I
call multiple instances of invalidate_inode_pages2() in parallel,
then I can run into a situation where there are waiters on 
this index but nobody will wait these.

invalidate_inode_pages2()
  invalidate_inode_pages2_range()
invalidate_exceptional_entry2()
  dax_invalidate_mapping_entry_sync()
__dax_invalidate_entry() {
xas_lock_irq();
entry = get_unlocked_entry(, 0);
...
...
dax_disassociate_entry(entry, mapping, trunc);
xas_store(, NULL);
...
...
put_unlocked_entry(, entry);
xas_unlock_irq();
} 

Say a fault in in progress and it has locked entry at offset say "0x1c".
Now say three instances of invalidate_inode_pages2() are in progress
(A, B, C) and they all try to invalidate entry at offset "0x1c". Given
dax entry is locked, all tree instances A, B, C will wait in wait queue.

When dax fault finishes, say A is woken up. It will store NULL entry
at index "0x1c" and wake up B. When B comes along it will find "entry=0"
at page offset 0x1c and it will call put_unlocked_entry(, 0). And
this means put_unlocked_entry() will not wake up next waiter, given
the current code. And that means C continues to wait and is not woken
up.

In my case I am seeing that dax page fault path itself is waiting
on grab_mapping_entry() and also invalidate_inode_page2() is 
waiting in get_unlocked_entry() but entry has already been cleaned
up and nobody woke up these processes. Atleast I think that's what
is happening.

This patch wakes up a process even if entry=0. And deadlock does not
happen. I am running into some OOM issues, that will debug.

So my question is that is it a dax issue and should it be fixed in
dax layer. Or should it be handled in fuse to make sure that
multiple instances of invalidate_inode_pages2() on same inode
don't make progress in parallel and introduce enough locking
around it.

Right now fuse_finish_open() calls invalidate_inode_pages2() without
any locking. That allows it to make progress in parallel to dax
fault path as well as allows multiple instances of invalidate_inode_pages2()
to run in parallel.

Not-yet-signed-off-by: Vivek Goyal 
---
 fs/dax.c |7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

Index: redhat-linux/fs/dax.c
===
--- redhat-linux.orig/fs/dax.c  2021-04-16 12:50:40.141363317 -0400
+++ redhat-linux/fs/dax.c   2021-04-16 12:51:42.385926390 -0400
@@ -266,9 +266,10 @@ static void wait_entry_unlocked(struct x
 
 static void put_unlocked_entry(struct xa_state *xas, void *entry)
 {
-   /* If we were the only waiter woken, wake the next one */
-   if (entry && !dax_is_conflict(entry))
-   dax_wake_entry(xas, entry, false);
+   if (dax_is_conflict(entry))
+   return;
+
+   dax_wake_entry(xas, entry, false);
 }
 
 /*
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH v3 00/18] virtiofs: Add DAX support

2020-08-28 Thread Vivek Goyal

On Fri, Aug 28, 2020 at 04:26:55PM +0200, Miklos Szeredi wrote:
> On Thu, Aug 20, 2020 at 12:21 AM Vivek Goyal  wrote:
> >
> > Hi All,
> >
> > This is V3 of patches. I had posted version v2 version here.
> 
> Pushed to:
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse.git#dax
> 
> Fixed a couple of minor issues, and added two patches:
> 
> 1. move dax specific code from fuse core to a separate source file
> 
> 2. move dax specific data, as well as allowing dax to be configured out
> 
> I think it would be cleaner to fold these back into the original
> series, but for now I'm just asking for comments and testing.

Thanks Miklos. I will have a look and test.

Vivek
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH v3 11/18] fuse: implement FUSE_INIT map_alignment field

2020-08-26 Thread Vivek Goyal

On Wed, Aug 26, 2020 at 09:26:29PM +0200, Miklos Szeredi wrote:
> On Wed, Aug 26, 2020 at 9:17 PM Dr. David Alan Gilbert
>  wrote:
> 
> > Agreed, because there's not much that the server can do about it if the
> > client would like a smaller granularity - the servers granularity might
> > be dictated by it's mmap/pagesize/filesystem.  If the client wants a
> > larger granularity that's it's choice when it sends the setupmapping
> > calls.
> 
> What bothers me is that the server now comes with the built in 2MiB
> granularity (obviously much larger than actually needed).
> 
> What if at some point we'd want to reduce that somewhat in the client?
>   Yeah, we can't.   Maybe this is not a kernel problem after all, the
> proper thing would be to fix the server to actually send something
> meaningful.

Hi Miklos,

Current implementation of virtiofsd reports this map alignment as
PAGE_SIZE.

/* This constraint comes from mmap(2) and munmap(2) */
outarg.map_alignment = ffsl(sysconf(_SC_PAGE_SIZE)) - 1;

Which should be 4K on x86. 

And that means if client wants it can drop to dax mapping size as
small as 4K and still meeting alignment constratints. Just that by
default we have chosen 2MB as of now fearing there might be too
many small mmap() calls on host and we will hit various limits.

Thanks
Vivek
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH v3 11/18] fuse: implement FUSE_INIT map_alignment field

2020-08-26 Thread Vivek Goyal

On Wed, Aug 26, 2020 at 04:06:35PM +0200, Miklos Szeredi wrote:
> On Thu, Aug 20, 2020 at 12:21 AM Vivek Goyal  wrote:
> >
> > The device communicates FUSE_SETUPMAPPING/FUSE_REMOVMAPPING alignment
> > constraints via the FUST_INIT map_alignment field.  Parse this field and
> > ensure our DAX mappings meet the alignment constraints.
> >
> > We don't actually align anything differently since our mappings are
> > already 2MB aligned.  Just check the value when the connection is
> > established.  If it becomes necessary to honor arbitrary alignments in
> > the future we'll have to adjust how mappings are sized.
> >
> > The upshot of this commit is that we can be confident that mappings will
> > work even when emulating x86 on Power and similar combinations where the
> > host page sizes are different.
> >
> > Signed-off-by: Stefan Hajnoczi 
> > Signed-off-by: Vivek Goyal 
> > ---
> >  fs/fuse/fuse_i.h  |  5 -
> >  fs/fuse/inode.c   | 18 --
> >  include/uapi/linux/fuse.h |  4 +++-
> >  3 files changed, 23 insertions(+), 4 deletions(-)
> >
> > diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> > index 478c940b05b4..4a46e35222c7 100644
> > --- a/fs/fuse/fuse_i.h
> > +++ b/fs/fuse/fuse_i.h
> > @@ -47,7 +47,10 @@
> >  /** Number of dentries for each connection in the control filesystem */
> >  #define FUSE_CTL_NUM_DENTRIES 5
> >
> > -/* Default memory range size, 2MB */
> > +/*
> > + * Default memory range size.  A power of 2 so it agrees with common 
> > FUSE_INIT
> > + * map_alignment values 4KB and 64KB.
> > + */
> >  #define FUSE_DAX_SZ(2*1024*1024)
> >  #define FUSE_DAX_SHIFT (21)
> >  #define FUSE_DAX_PAGES (FUSE_DAX_SZ/PAGE_SIZE)
> > diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> > index b82eb61d63cc..947abdd776ca 100644
> > --- a/fs/fuse/inode.c
> > +++ b/fs/fuse/inode.c
> > @@ -980,9 +980,10 @@ static void process_init_reply(struct fuse_conn *fc, 
> > struct fuse_args *args,
> >  {
> > struct fuse_init_args *ia = container_of(args, typeof(*ia), args);
> > struct fuse_init_out *arg = >out;
> > +   bool ok = true;
> >
> > if (error || arg->major != FUSE_KERNEL_VERSION)
> > -   fc->conn_error = 1;
> > +   ok = false;
> > else {
> > unsigned long ra_pages;
> >
> > @@ -1045,6 +1046,13 @@ static void process_init_reply(struct fuse_conn *fc, 
> > struct fuse_args *args,
> > min_t(unsigned int, 
> > FUSE_MAX_MAX_PAGES,
> > max_t(unsigned int, arg->max_pages, 
> > 1));
> > }
> > +   if ((arg->flags & FUSE_MAP_ALIGNMENT) &&
> > +   (FUSE_DAX_SZ % (1ul << arg->map_alignment))) {
> 
> This just obfuscates "arg->map_alignment != FUSE_DAX_SHIFT".
> 
> So the intention was that userspace can ask the kernel for a
> particular alignment, right?

My understanding is that device will specify alignment for
the foffset/moffset fields in fuse_setupmapping_in/fuse_removemapping_one.
And DAX mapping can be any size meeting that alignment contraint.

> 
> In that case kernel can definitely succeed if the requested alignment
> is smaller than the kernel provided one, no? 

Yes. So if map_alignemnt is 64K and DAX mapping size is 2MB, that's just
fine because it meets 4K alignment contraint. Just that we can't use
4K size DAX mapping in that case.

> It would also make
> sense to make this a two way negotiation.  I.e. send the largest
> alignment (FUSE_DAX_SHIFT in this implementation) that the kernel can
> provide in fuse_init_in.   In that case the only error would be if
> userspace ignored the given constraints.

We could make it two way negotiation if it helps. So if we support
multiple mapping sizes in future, say 4K, 64K, 2MB, 1GB. So idea is
to send alignment of largest mapping size to device/user_space (1GB)
in this case? And that will allow device to choose an alignment
which best fits its needs?

But problem here is that sending (log2(1GB)) does not mean we support
all the alignments in that range. For example, if device selects say
256MB as minimum alignment, kernel might not support it.

So there seem to be two ways to handle this.

A.Let device be conservative and always specify the minimum aligment
  it can work with and let guest kernel automatically choose a mapping
  size which meets that min_alignment contraint.

B.Send all the mapping sizes supported by kernel

Re: [PATCH v3 02/18] dax: Create a range version of dax_layout_busy_page()

2020-08-20 Thread Vivek Goyal

On Thu, Aug 20, 2020 at 02:58:55PM +0200, Jan Kara wrote:
[..]
> >  /**
> > - * dax_layout_busy_page - find first pinned page in @mapping
> > + * dax_layout_busy_page_range - find first pinned page in @mapping
> >   * @mapping: address space to scan for a page with ref count > 1
> 
> Please document additional function arguments in the kernel-doc comment.
> 
> Otherwise the patch looks good so feel free to add:
> 
> Reviewed-by: Jan Kara 
> 
> after fixing this nit.
> 

Hi Jan

Thanks for the review. Here is the updated patch. I also captured your
Reviewed-by.


>From 3f81f769be9419ffc5a788833339ed439dbcd48e Mon Sep 17 00:00:00 2001
From: Vivek Goyal 
Date: Tue, 3 Mar 2020 14:58:21 -0500
Subject: [PATCH 02/20] dax: Create a range version of dax_layout_busy_page()

virtiofs device has a range of memory which is mapped into file inodes
using dax. This memory is mapped in qemu on host and maps different
sections of real file on host. Size of this memory is limited
(determined by administrator) and depending on filesystem size, we will
soon reach a situation where all the memory is in use and we need to
reclaim some.

As part of reclaim process, we will need to make sure that there are
no active references to pages (taken by get_user_pages()) on the memory
range we are trying to reclaim. I am planning to use
dax_layout_busy_page() for this. But in current form this is per inode
and scans through all the pages of the inode.

We want to reclaim only a portion of memory (say 2MB page). So we want
to make sure that only that 2MB range of pages do not have any
references  (and don't want to unmap all the pages of inode).

Hence, create a range version of this function named
dax_layout_busy_page_range() which can be used to pass a range which
needs to be unmapped.

Cc: Dan Williams 
Cc: linux-nvdimm@lists.01.org
Cc: Jan Kara 
Cc: Vishal L Verma 
Cc: "Weiny, Ira" 
Signed-off-by: Vivek Goyal 
Reviewed-by: Jan Kara 
---
 fs/dax.c|   29 +++--
 include/linux/dax.h |6 ++
 2 files changed, 29 insertions(+), 6 deletions(-)

Index: redhat-linux/fs/dax.c
===
--- redhat-linux.orig/fs/dax.c  2020-08-20 14:04:41.995676669 +
+++ redhat-linux/fs/dax.c   2020-08-20 14:15:20.072676669 +
@@ -559,8 +559,11 @@ fallback:
 }
 
 /**
- * dax_layout_busy_page - find first pinned page in @mapping
+ * dax_layout_busy_page_range - find first pinned page in @mapping
  * @mapping: address space to scan for a page with ref count > 1
+ * @start: Starting offset. Page containing 'start' is included.
+ * @end: End offset. Page containing 'end' is included. If 'end' is LLONG_MAX,
+ *   pages from 'start' till the end of file are included.
  *
  * DAX requires ZONE_DEVICE mapped pages. These pages are never
  * 'onlined' to the page allocator so they are considered idle when
@@ -573,12 +576,15 @@ fallback:
  * to be able to run unmap_mapping_range() and subsequently not race
  * mapping_mapped() becoming true.
  */
-struct page *dax_layout_busy_page(struct address_space *mapping)
+struct page *dax_layout_busy_page_range(struct address_space *mapping,
+   loff_t start, loff_t end)
 {
-   XA_STATE(xas, >i_pages, 0);
void *entry;
unsigned int scanned = 0;
struct page *page = NULL;
+   pgoff_t start_idx = start >> PAGE_SHIFT;
+   pgoff_t end_idx;
+   XA_STATE(xas, >i_pages, start_idx);
 
/*
 * In the 'limited' case get_user_pages() for dax is disabled.
@@ -589,6 +595,11 @@ struct page *dax_layout_busy_page(struct
if (!dax_mapping(mapping) || !mapping_mapped(mapping))
return NULL;
 
+   /* If end == LLONG_MAX, all pages from start to till end of file */
+   if (end == LLONG_MAX)
+   end_idx = ULONG_MAX;
+   else
+   end_idx = end >> PAGE_SHIFT;
/*
 * If we race get_user_pages_fast() here either we'll see the
 * elevated page count in the iteration and wait, or
@@ -596,15 +607,15 @@ struct page *dax_layout_busy_page(struct
 * against is no longer mapped in the page tables and bail to the
 * get_user_pages() slow path.  The slow path is protected by
 * pte_lock() and pmd_lock(). New references are not taken without
-* holding those locks, and unmap_mapping_range() will not zero the
+* holding those locks, and unmap_mapping_pages() will not zero the
 * pte or pmd without holding the respective lock, so we are
 * guaranteed to either see new references or prevent new
 * references from being established.
 */
-   unmap_mapping_range(mapping, 0, 0, 0);
+   unmap_mapping_pages(mapping, start_idx, end_idx - start_idx + 1, 0);
 
xas_lock_irq();
-   xas_for_each(, entry, ULONG_MAX) {
+

[PATCH v3 16/18] fuse, dax: Serialize truncate/punch_hole and dax fault path

2020-08-19 Thread Vivek Goyal

Currently in fuse we don't seem have any lock which can serialize fault
path with truncate/punch_hole path. With dax support I need one for
following reasons.

1. Dax requirement

  DAX fault code relies on inode size being stable for the duration of
  fault and want to serialize with truncate/punch_hole and they explicitly
  mention it.

  static vm_fault_t dax_iomap_pmd_fault(struct vm_fault *vmf, pfn_t *pfnp,
   const struct iomap_ops *ops)
/*
 * Check whether offset isn't beyond end of file now. Caller is
 * supposed to hold locks serializing us with truncate / punch hole so
 * this is a reliable test.
 */
max_pgoff = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE);

2. Make sure there are no users of pages being truncated/punch_hole

  get_user_pages() might take references to page and then do some DMA
  to said pages. Filesystem might truncate those pages without knowing
  that a DMA is in progress or some I/O is in progress. So use
  dax_layout_busy_page() to make sure there are no such references
  and I/O is not in progress on said pages before moving ahead with
  truncation.

3. Limitation of kvm page fault error reporting

  If we are truncating file on host first and then removing mappings in
  guest lateter (truncate page cache etc), then this could lead to a
  problem with KVM. Say a mapping is in place in guest and truncation
  happens on host. Now if guest accesses that mapping, then host will
  take a fault and kvm will either exit to qemu or spin infinitely.

  IOW, before we do truncation on host, we need to make sure that guest
  inode does not have any mapping in that region or whole file.

4. virtiofs memory range reclaim

 Soon I will introduce the notion of being able to reclaim dax memory
 ranges from a fuse dax inode. There also I need to make sure that
 no I/O or fault is going on in the reclaimed range and nobody is using
 it so that range can be reclaimed without issues.

Currently if we take inode lock, that serializes read/write. But it does
not do anything for faults. So I add another semaphore fuse_inode->i_mmap_sem
for this purpose.  It can be used to serialize with faults.

As of now, I am adding taking this semaphore only in dax fault path and
not regular fault path because existing code does not have one. May
be existing code can benefit from it as well to take care of some
races, but that we can fix later if need be. For now, I am just focussing
only on DAX path which is new path.

Also added logic to take fuse_inode->i_mmap_sem in
truncate/punch_hole/open(O_TRUNC) path to make sure file truncation and
fuse dax fault are mutually exlusive and avoid all the above problems.

Signed-off-by: Vivek Goyal 
Cc: Dave Chinner 
---
 fs/fuse/dir.c| 32 ++-
 fs/fuse/file.c   | 81 +---
 fs/fuse/fuse_i.h |  9 ++
 fs/fuse/inode.c  |  1 +
 4 files changed, 112 insertions(+), 11 deletions(-)

diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index 26f028bc760b..4c7e29ba7c4c 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -1501,6 +1501,7 @@ int fuse_do_setattr(struct dentry *dentry, struct iattr 
*attr,
loff_t oldsize;
int err;
bool trust_local_cmtime = is_wb && S_ISREG(inode->i_mode);
+   bool fault_blocked = false;
 
if (!fc->default_permissions)
attr->ia_valid |= ATTR_FORCE;
@@ -1509,6 +1510,22 @@ int fuse_do_setattr(struct dentry *dentry, struct iattr 
*attr,
if (err)
return err;
 
+   if (attr->ia_valid & ATTR_SIZE) {
+   if (WARN_ON(!S_ISREG(inode->i_mode)))
+   return -EIO;
+   is_truncate = true;
+   }
+
+   if (IS_DAX(inode) && is_truncate) {
+   down_write(>i_mmap_sem);
+   fault_blocked = true;
+   err = fuse_break_dax_layouts(inode, 0, 0);
+   if (err) {
+   up_write(>i_mmap_sem);
+   return err;
+   }
+   }
+
if (attr->ia_valid & ATTR_OPEN) {
/* This is coming from open(..., ... | O_TRUNC); */
WARN_ON(!(attr->ia_valid & ATTR_SIZE));
@@ -1521,17 +1538,11 @@ int fuse_do_setattr(struct dentry *dentry, struct iattr 
*attr,
 */
i_size_write(inode, 0);
truncate_pagecache(inode, 0);
-   return 0;
+   goto out;
}
file = NULL;
}
 
-   if (attr->ia_valid & ATTR_SIZE) {
-   if (WARN_ON(!S_ISREG(inode->i_mode)))
-   return -EIO;
-   is_truncate = true;
-   }
-
/* Flush dirty data/metadata before non-truncate SETATTR */
if (is_wb && S_ISREG(inode->i_mode) &&
attr-

[PATCH v3 10/18] fuse,virtiofs: Keep a list of free dax memory ranges

2020-08-19 Thread Vivek Goyal

Divide the dax memory range into fixed size ranges (2MB for now) and put
them in a list. This will track free ranges. Once an inode requires a
free range, we will take one from here and put it in interval-tree
of ranges assigned to inode.

Signed-off-by: Vivek Goyal 
Signed-off-by: Peng Tao 
---
 fs/fuse/fuse_i.h| 23 
 fs/fuse/inode.c | 88 -
 fs/fuse/virtio_fs.c |  2 ++
 3 files changed, 112 insertions(+), 1 deletion(-)

diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 04fdd7c41bd1..478c940b05b4 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -47,6 +47,11 @@
 /** Number of dentries for each connection in the control filesystem */
 #define FUSE_CTL_NUM_DENTRIES 5
 
+/* Default memory range size, 2MB */
+#define FUSE_DAX_SZ(2*1024*1024)
+#define FUSE_DAX_SHIFT (21)
+#define FUSE_DAX_PAGES (FUSE_DAX_SZ/PAGE_SIZE)
+
 /** List of active connections */
 extern struct list_head fuse_conn_list;
 
@@ -63,6 +68,18 @@ struct fuse_forget_link {
struct fuse_forget_link *next;
 };
 
+/** Translation information for file offsets to DAX window offsets */
+struct fuse_dax_mapping {
+   /* Will connect in fc->free_ranges to keep track of free memory */
+   struct list_head list;
+
+   /** Position in DAX window */
+   u64 window_offset;
+
+   /** Length of mapping, in bytes */
+   loff_t length;
+};
+
 /** FUSE inode */
 struct fuse_inode {
/** Inode data */
@@ -768,6 +785,12 @@ struct fuse_conn {
 
/** DAX device, non-NULL if DAX is supported */
struct dax_device *dax_dev;
+
+   /*
+* DAX Window Free Ranges
+*/
+   long nr_free_ranges;
+   struct list_head free_ranges;
 };
 
 static inline struct fuse_conn *get_fuse_conn_super(struct super_block *sb)
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index beac337ccc10..b82eb61d63cc 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -23,6 +23,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 
 MODULE_AUTHOR("Miklos Szeredi ");
 MODULE_DESCRIPTION("Filesystem in Userspace");
@@ -620,6 +622,76 @@ static void fuse_pqueue_init(struct fuse_pqueue *fpq)
fpq->connected = 1;
 }
 
+static void fuse_free_dax_mem_ranges(struct list_head *mem_list)
+{
+   struct fuse_dax_mapping *range, *temp;
+
+   /* Free All allocated elements */
+   list_for_each_entry_safe(range, temp, mem_list, list) {
+   list_del(>list);
+   kfree(range);
+   }
+}
+
+#ifdef CONFIG_FS_DAX
+static int fuse_dax_mem_range_init(struct fuse_conn *fc,
+  struct dax_device *dax_dev)
+{
+   long nr_pages, nr_ranges;
+   void *kaddr;
+   pfn_t pfn;
+   struct fuse_dax_mapping *range;
+   LIST_HEAD(mem_ranges);
+   phys_addr_t phys_addr;
+   int ret = 0, id;
+   size_t dax_size = -1;
+   unsigned long i;
+
+   id = dax_read_lock();
+   nr_pages = dax_direct_access(dax_dev, 0, PHYS_PFN(dax_size), ,
+   );
+   dax_read_unlock(id);
+   if (nr_pages < 0) {
+   pr_debug("dax_direct_access() returned %ld\n", nr_pages);
+   return nr_pages;
+   }
+
+   phys_addr = pfn_t_to_phys(pfn);
+   nr_ranges = nr_pages/FUSE_DAX_PAGES;
+   printk("fuse_dax_mem_range_init(): dax mapped %ld pages. 
nr_ranges=%ld\n", nr_pages, nr_ranges);
+
+   for (i = 0; i < nr_ranges; i++) {
+   range = kzalloc(sizeof(struct fuse_dax_mapping), GFP_KERNEL);
+   if (!range) {
+   pr_debug("memory allocation for mem_range failed.\n");
+   ret = -ENOMEM;
+   goto out_err;
+   }
+   /* TODO: This offset only works if virtio-fs driver is not
+* having some memory hidden at the beginning. This needs
+* better handling
+*/
+   range->window_offset = i * FUSE_DAX_SZ;
+   range->length = FUSE_DAX_SZ;
+   list_add_tail(>list, _ranges);
+   }
+
+   list_replace_init(_ranges, >free_ranges);
+   fc->nr_free_ranges = nr_ranges;
+   return 0;
+out_err:
+   /* Free All allocated elements */
+   fuse_free_dax_mem_ranges(_ranges);
+   return ret;
+}
+#else /* !CONFIG_FS_DAX */
+static inline int fuse_dax_mem_range_init(struct fuse_conn *fc,
+ struct dax_device *dax_dev)
+{
+   return 0;
+}
+#endif /* CONFIG_FS_DAX */
+
 void fuse_conn_init(struct fuse_conn *fc, struct user_namespace *user_ns,
const struct fuse_iqueue_ops *fiq_ops, void *fiq_priv)
 {
@@ -647,6 +719,7 @@ void fuse_conn_init(struct fuse_conn *fc, struct 
user_namespace *user_ns,
fc->pid_ns = get_pid_ns(task_active_pid_ns(current));
fc->

[PATCH v3 14/18] fuse,dax: add DAX mmap support

2020-08-19 Thread Vivek Goyal

From: Stefan Hajnoczi 

Add DAX mmap() support.

Signed-off-by: Stefan Hajnoczi 
---
 fs/fuse/file.c | 62 +-
 1 file changed, 61 insertions(+), 1 deletion(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 99457d0b14b9..f1ad8b95b546 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -2841,10 +2841,15 @@ static const struct vm_operations_struct 
fuse_file_vm_ops = {
.page_mkwrite   = fuse_page_mkwrite,
 };
 
+static int fuse_dax_mmap(struct file *file, struct vm_area_struct *vma);
 static int fuse_file_mmap(struct file *file, struct vm_area_struct *vma)
 {
struct fuse_file *ff = file->private_data;
 
+   /* DAX mmap is superior to direct_io mmap */
+   if (IS_DAX(file_inode(file)))
+   return fuse_dax_mmap(file, vma);
+
if (ff->open_flags & FOPEN_DIRECT_IO) {
/* Can't provide the coherency needed for MAP_SHARED */
if (vma->vm_flags & VM_MAYSHARE)
@@ -2863,9 +2868,63 @@ static int fuse_file_mmap(struct file *file, struct 
vm_area_struct *vma)
return 0;
 }
 
+static vm_fault_t __fuse_dax_fault(struct vm_fault *vmf,
+  enum page_entry_size pe_size, bool write)
+{
+   vm_fault_t ret;
+   struct inode *inode = file_inode(vmf->vma->vm_file);
+   struct super_block *sb = inode->i_sb;
+   pfn_t pfn;
+
+   if (write)
+   sb_start_pagefault(sb);
+
+   ret = dax_iomap_fault(vmf, pe_size, , NULL, _iomap_ops);
+
+   if (ret & VM_FAULT_NEEDDSYNC)
+   ret = dax_finish_sync_fault(vmf, pe_size, pfn);
+
+   if (write)
+   sb_end_pagefault(sb);
+
+   return ret;
+}
+
+static vm_fault_t fuse_dax_fault(struct vm_fault *vmf)
+{
+   return __fuse_dax_fault(vmf, PE_SIZE_PTE,
+   vmf->flags & FAULT_FLAG_WRITE);
+}
+
+static vm_fault_t fuse_dax_huge_fault(struct vm_fault *vmf,
+  enum page_entry_size pe_size)
+{
+   return __fuse_dax_fault(vmf, pe_size, vmf->flags & FAULT_FLAG_WRITE);
+}
+
+static vm_fault_t fuse_dax_page_mkwrite(struct vm_fault *vmf)
+{
+   return __fuse_dax_fault(vmf, PE_SIZE_PTE, true);
+}
+
+static vm_fault_t fuse_dax_pfn_mkwrite(struct vm_fault *vmf)
+{
+   return __fuse_dax_fault(vmf, PE_SIZE_PTE, true);
+}
+
+static const struct vm_operations_struct fuse_dax_vm_ops = {
+   .fault  = fuse_dax_fault,
+   .huge_fault = fuse_dax_huge_fault,
+   .page_mkwrite   = fuse_dax_page_mkwrite,
+   .pfn_mkwrite= fuse_dax_pfn_mkwrite,
+};
+
 static int fuse_dax_mmap(struct file *file, struct vm_area_struct *vma)
 {
-   return -EINVAL; /* TODO */
+   file_accessed(file);
+   vma->vm_ops = _dax_vm_ops;
+   vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE;
+   return 0;
 }
 
 static int convert_fuse_file_lock(struct fuse_conn *fc,
@@ -3938,6 +3997,7 @@ static const struct file_operations fuse_file_operations 
= {
.release= fuse_release,
.fsync  = fuse_fsync,
.lock   = fuse_file_lock,
+   .get_unmapped_area = thp_get_unmapped_area,
.flock  = fuse_file_flock,
.splice_read= generic_file_splice_read,
.splice_write   = iter_file_splice_write,
-- 
2.25.4
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

[PATCH v3 06/18] virtiofs: Provide a helper function for virtqueue initialization

2020-08-19 Thread Vivek Goyal

This reduces code duplication and make it little easier to read code.

Signed-off-by: Vivek Goyal 
---
 fs/fuse/virtio_fs.c | 50 +++--
 1 file changed, 30 insertions(+), 20 deletions(-)

diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
index 104f35de5270..ed8da4825b70 100644
--- a/fs/fuse/virtio_fs.c
+++ b/fs/fuse/virtio_fs.c
@@ -24,6 +24,8 @@ enum {
VQ_REQUEST
 };
 
+#define VQ_NAME_LEN24
+
 /* Per-virtqueue state */
 struct virtio_fs_vq {
spinlock_t lock;
@@ -36,7 +38,7 @@ struct virtio_fs_vq {
bool connected;
long in_flight;
struct completion in_flight_zero; /* No inflight requests */
-   char name[24];
+   char name[VQ_NAME_LEN];
 } cacheline_aligned_in_smp;
 
 /* A virtio-fs device instance */
@@ -596,6 +598,26 @@ static void virtio_fs_vq_done(struct virtqueue *vq)
schedule_work(>done_work);
 }
 
+static void virtio_fs_init_vq(struct virtio_fs_vq *fsvq, char *name,
+ int vq_type)
+{
+   strncpy(fsvq->name, name, VQ_NAME_LEN);
+   spin_lock_init(>lock);
+   INIT_LIST_HEAD(>queued_reqs);
+   INIT_LIST_HEAD(>end_reqs);
+   init_completion(>in_flight_zero);
+
+   if (vq_type == VQ_REQUEST) {
+   INIT_WORK(>done_work, virtio_fs_requests_done_work);
+   INIT_DELAYED_WORK(>dispatch_work,
+ virtio_fs_request_dispatch_work);
+   } else {
+   INIT_WORK(>done_work, virtio_fs_hiprio_done_work);
+   INIT_DELAYED_WORK(>dispatch_work,
+ virtio_fs_hiprio_dispatch_work);
+   }
+}
+
 /* Initialize virtqueues */
 static int virtio_fs_setup_vqs(struct virtio_device *vdev,
   struct virtio_fs *fs)
@@ -611,7 +633,7 @@ static int virtio_fs_setup_vqs(struct virtio_device *vdev,
if (fs->num_request_queues == 0)
return -EINVAL;
 
-   fs->nvqs = 1 + fs->num_request_queues;
+   fs->nvqs = VQ_REQUEST + fs->num_request_queues;
fs->vqs = kcalloc(fs->nvqs, sizeof(fs->vqs[VQ_HIPRIO]), GFP_KERNEL);
if (!fs->vqs)
return -ENOMEM;
@@ -625,29 +647,17 @@ static int virtio_fs_setup_vqs(struct virtio_device *vdev,
goto out;
}
 
+   /* Initialize the hiprio/forget request virtqueue */
callbacks[VQ_HIPRIO] = virtio_fs_vq_done;
-   snprintf(fs->vqs[VQ_HIPRIO].name, sizeof(fs->vqs[VQ_HIPRIO].name),
-   "hiprio");
+   virtio_fs_init_vq(>vqs[VQ_HIPRIO], "hiprio", VQ_HIPRIO);
names[VQ_HIPRIO] = fs->vqs[VQ_HIPRIO].name;
-   INIT_WORK(>vqs[VQ_HIPRIO].done_work, virtio_fs_hiprio_done_work);
-   INIT_LIST_HEAD(>vqs[VQ_HIPRIO].queued_reqs);
-   INIT_LIST_HEAD(>vqs[VQ_HIPRIO].end_reqs);
-   INIT_DELAYED_WORK(>vqs[VQ_HIPRIO].dispatch_work,
-   virtio_fs_hiprio_dispatch_work);
-   init_completion(>vqs[VQ_HIPRIO].in_flight_zero);
-   spin_lock_init(>vqs[VQ_HIPRIO].lock);
 
/* Initialize the requests virtqueues */
for (i = VQ_REQUEST; i < fs->nvqs; i++) {
-   spin_lock_init(>vqs[i].lock);
-   INIT_WORK(>vqs[i].done_work, virtio_fs_requests_done_work);
-   INIT_DELAYED_WORK(>vqs[i].dispatch_work,
- virtio_fs_request_dispatch_work);
-   INIT_LIST_HEAD(>vqs[i].queued_reqs);
-   INIT_LIST_HEAD(>vqs[i].end_reqs);
-   init_completion(>vqs[i].in_flight_zero);
-   snprintf(fs->vqs[i].name, sizeof(fs->vqs[i].name),
-"requests.%u", i - VQ_REQUEST);
+   char vq_name[VQ_NAME_LEN];
+
+   snprintf(vq_name, VQ_NAME_LEN, "requests.%u", i - VQ_REQUEST);
+   virtio_fs_init_vq(>vqs[i], vq_name, VQ_REQUEST);
callbacks[i] = virtio_fs_vq_done;
names[i] = fs->vqs[i].name;
}
-- 
2.25.4
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

[PATCH v3 12/18] fuse: Introduce setupmapping/removemapping commands

2020-08-19 Thread Vivek Goyal

Introduce two new fuse commands to setup/remove memory mappings. This
will be used to setup/tear down file mapping in dax window.

Signed-off-by: Vivek Goyal 
Signed-off-by: Peng Tao 
---
 include/uapi/linux/fuse.h | 29 +
 1 file changed, 29 insertions(+)

diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
index 5b85819e045f..60a7bfc787ce 100644
--- a/include/uapi/linux/fuse.h
+++ b/include/uapi/linux/fuse.h
@@ -894,4 +894,33 @@ struct fuse_copy_file_range_in {
uint64_tflags;
 };
 
+#define FUSE_SETUPMAPPING_FLAG_WRITE (1ull << 0)
+struct fuse_setupmapping_in {
+   /* An already open handle */
+   uint64_tfh;
+   /* Offset into the file to start the mapping */
+   uint64_tfoffset;
+   /* Length of mapping required */
+   uint64_tlen;
+   /* Flags, FUSE_SETUPMAPPING_FLAG_* */
+   uint64_tflags;
+   /* Offset in Memory Window */
+   uint64_tmoffset;
+};
+
+struct fuse_removemapping_in {
+   /* number of fuse_removemapping_one follows */
+   uint32_tcount;
+};
+
+struct fuse_removemapping_one {
+   /* Offset into the dax window start the unmapping */
+   uint64_tmoffset;
+   /* Length of mapping required */
+   uint64_tlen;
+};
+
+#define FUSE_REMOVEMAPPING_MAX_ENTRY   \
+   (PAGE_SIZE / sizeof(struct fuse_removemapping_one))
+
 #endif /* _LINUX_FUSE_H */
-- 
2.25.4
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

[PATCH v3 00/18] virtiofs: Add DAX support

2020-08-19 Thread Vivek Goyal

Hi All,

This is V3 of patches. I had posted version v2 version here.

https://lore.kernel.org/linux-fsdevel/20200807195526.426056-1-vgo...@redhat.com/

I have taken care of comments from V2. Changes from V2 are.

- Rebased patches on top of 5.9-rc1

- Renamed couple of functions to get rid of iomap prefix. (Dave Chinner)

- Modified truncate/punch_hole paths to serialize with dax fault
  path. For now did this only for dax paths. May be non-dax path
  can benefit from this too. But that is an option for a different
  day. (Dave Chinner).

- Took care of comments by Jan Kara in dax_layout_busy_page_range()
  implementation patch.

- Dropped one of the patches which forced sync release in
  fuse_file_put() path for DAX files. It was redundant now as virtiofs
  already sets fs_context->destroy which forces sync release. (Miklos)

- Took care of some of the errors flagged by checkpatch.pl.

Description from previous post
--

This patch series adds DAX support to virtiofs filesystem. This allows
bypassing guest page cache and allows mapping host page cache directly
in guest address space.

When a page of file is needed, guest sends a request to map that page
(in host page cache) in qemu address space. Inside guest this is
a physical memory range controlled by virtiofs device. And guest
directly maps this physical address range using DAX and hence gets
access to file data on host.

This can speed up things considerably in many situations. Also this
can result in substantial memory savings as file data does not have
to be copied in guest and it is directly accessed from host page
cache.

Most of the changes are limited to fuse/virtiofs. There are couple
of changes needed in generic dax infrastructure and couple of changes
in virtio to be able to access shared memory region.

Thanks
Vivek

Sebastien Boeuf (3):
  virtio: Add get_shm_region method
  virtio: Implement get_shm_region for PCI transport
  virtio: Implement get_shm_region for MMIO transport

Stefan Hajnoczi (2):
  virtio_fs, dax: Set up virtio_fs dax_device
  fuse,dax: add DAX mmap support

Vivek Goyal (13):
  dax: Modify bdev_dax_pgoff() to handle NULL bdev
  dax: Create a range version of dax_layout_busy_page()
  virtiofs: Provide a helper function for virtqueue initialization
  fuse: Get rid of no_mount_options
  fuse,virtiofs: Add a mount option to enable dax
  fuse,virtiofs: Keep a list of free dax memory ranges
  fuse: implement FUSE_INIT map_alignment field
  fuse: Introduce setupmapping/removemapping commands
  fuse, dax: Implement dax read/write operations
  fuse,virtiofs: Define dax address space operations
  fuse, dax: Serialize truncate/punch_hole and dax fault path
  fuse,virtiofs: Maintain a list of busy elements
  fuse,virtiofs: Add logic to free up a memory range

 drivers/dax/super.c|3 +-
 drivers/virtio/virtio_mmio.c   |   31 +
 drivers/virtio/virtio_pci_modern.c |   95 +++
 fs/dax.c   |   29 +-
 fs/fuse/dir.c  |   32 +-
 fs/fuse/file.c | 1198 +++-
 fs/fuse/fuse_i.h   |  114 ++-
 fs/fuse/inode.c|  146 +++-
 fs/fuse/virtio_fs.c|  279 ++-
 include/linux/dax.h|6 +
 include/linux/virtio_config.h  |   17 +
 include/uapi/linux/fuse.h  |   34 +-
 include/uapi/linux/virtio_fs.h |3 +
 include/uapi/linux/virtio_mmio.h   |   11 +
 include/uapi/linux/virtio_pci.h|   11 +-
 15 files changed, 1933 insertions(+), 76 deletions(-)

Cc: Jan Kara 
Cc: Dave Chinner 
Cc: Christoph Hellwig 
Cc: Ira Weiny 
Cc: "Michael S. Tsirkin" 
Cc: Vishal L Verma 
-- 
2.25.4
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

[PATCH v3 13/18] fuse, dax: Implement dax read/write operations

2020-08-19 Thread Vivek Goyal

This patch implements basic DAX support. mmap() is not implemented
yet and will come in later patches. This patch looks into implemeting
read/write.

We make use of interval tree to keep track of per inode dax mappings.

Do not use dax for file extending writes, instead just send WRITE message
to daemon (like we do for direct I/O path). This will keep write and
i_size change atomic w.r.t crash.

Signed-off-by: Stefan Hajnoczi 
Signed-off-by: Dr. David Alan Gilbert 
Signed-off-by: Vivek Goyal 
Signed-off-by: Miklos Szeredi 
Signed-off-by: Liu Bo 
Signed-off-by: Peng Tao 
Cc: Dave Chinner 
---
 fs/fuse/file.c| 550 +-
 fs/fuse/fuse_i.h  |  26 ++
 fs/fuse/inode.c   |   6 +
 include/uapi/linux/fuse.h |   1 +
 4 files changed, 577 insertions(+), 6 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 6611ef3269a8..99457d0b14b9 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -19,6 +19,9 @@
 #include 
 #include 
 #include 
+#include 
+#include 
+#include 
 
 static struct page **fuse_pages_alloc(unsigned int npages, gfp_t flags,
  struct fuse_page_desc **desc)
@@ -188,6 +191,228 @@ static void fuse_link_write_file(struct file *file)
spin_unlock(>lock);
 }
 
+static struct fuse_dax_mapping *alloc_dax_mapping(struct fuse_conn *fc)
+{
+   struct fuse_dax_mapping *dmap = NULL;
+
+   spin_lock(>lock);
+
+   if (fc->nr_free_ranges <= 0) {
+   spin_unlock(>lock);
+   return NULL;
+   }
+
+   WARN_ON(list_empty(>free_ranges));
+
+   /* Take a free range */
+   dmap = list_first_entry(>free_ranges, struct fuse_dax_mapping,
+   list);
+   list_del_init(>list);
+   fc->nr_free_ranges--;
+   spin_unlock(>lock);
+   return dmap;
+}
+
+/* This assumes fc->lock is held */
+static void __dmap_add_to_free_pool(struct fuse_conn *fc,
+   struct fuse_dax_mapping *dmap)
+{
+   list_add_tail(>list, >free_ranges);
+   fc->nr_free_ranges++;
+}
+
+static void dmap_add_to_free_pool(struct fuse_conn *fc,
+   struct fuse_dax_mapping *dmap)
+{
+   /* Return fuse_dax_mapping to free list */
+   spin_lock(>lock);
+   __dmap_add_to_free_pool(fc, dmap);
+   spin_unlock(>lock);
+}
+
+static int fuse_setup_one_mapping(struct inode *inode, unsigned long start_idx,
+ struct fuse_dax_mapping *dmap, bool writable,
+ bool upgrade)
+{
+   struct fuse_conn *fc = get_fuse_conn(inode);
+   struct fuse_inode *fi = get_fuse_inode(inode);
+   struct fuse_setupmapping_in inarg;
+   loff_t offset = start_idx << FUSE_DAX_SHIFT;
+   FUSE_ARGS(args);
+   ssize_t err;
+
+   WARN_ON(fc->nr_free_ranges < 0);
+
+   /* Ask fuse daemon to setup mapping */
+   memset(, 0, sizeof(inarg));
+   inarg.foffset = offset;
+   inarg.fh = -1;
+   inarg.moffset = dmap->window_offset;
+   inarg.len = FUSE_DAX_SZ;
+   inarg.flags |= FUSE_SETUPMAPPING_FLAG_READ;
+   if (writable)
+   inarg.flags |= FUSE_SETUPMAPPING_FLAG_WRITE;
+   args.opcode = FUSE_SETUPMAPPING;
+   args.nodeid = fi->nodeid;
+   args.in_numargs = 1;
+   args.in_args[0].size = sizeof(inarg);
+   args.in_args[0].value = 
+   err = fuse_simple_request(fc, );
+   if (err < 0)
+   return err;
+   dmap->writable = writable;
+   if (!upgrade) {
+   dmap->itn.start = dmap->itn.last = start_idx;
+   /* Protected by fi->i_dmap_sem */
+   interval_tree_insert(>itn, >dmap_tree);
+   fi->nr_dmaps++;
+   }
+   return 0;
+}
+
+static int
+fuse_send_removemapping(struct inode *inode,
+   struct fuse_removemapping_in *inargp,
+   struct fuse_removemapping_one *remove_one)
+{
+   struct fuse_inode *fi = get_fuse_inode(inode);
+   struct fuse_conn *fc = get_fuse_conn(inode);
+   FUSE_ARGS(args);
+
+   args.opcode = FUSE_REMOVEMAPPING;
+   args.nodeid = fi->nodeid;
+   args.in_numargs = 2;
+   args.in_args[0].size = sizeof(*inargp);
+   args.in_args[0].value = inargp;
+   args.in_args[1].size = inargp->count * sizeof(*remove_one);
+   args.in_args[1].value = remove_one;
+   return fuse_simple_request(fc, );
+}
+
+static int dmap_removemapping_list(struct inode *inode, unsigned num,
+  struct list_head *to_remove)
+{
+   struct fuse_removemapping_one *remove_one, *ptr;
+   struct fuse_removemapping_in inarg;
+   struct fuse_dax_mapping *dmap;
+   int ret, i = 0, nr_alloc;
+
+   nr_alloc = min_t(unsigned int, num, FUSE_REMOVEMAPPING_MAX_ENTRY);
+   remove_one = kmalloc

[PATCH v3 15/18] fuse,virtiofs: Define dax address space operations

2020-08-19 Thread Vivek Goyal

This is done along the lines of ext4 and xfs. I primarily wanted ->writepages
hook at this time so that I could call into dax_writeback_mapping_range().
This in turn will decide which pfns need to be written back.

Signed-off-by: Vivek Goyal 
---
 fs/fuse/file.c | 21 -
 1 file changed, 20 insertions(+), 1 deletion(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index f1ad8b95b546..0eecb4097c14 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -2669,6 +2669,16 @@ static int fuse_writepages_fill(struct page *page,
return err;
 }
 
+static int fuse_dax_writepages(struct address_space *mapping,
+   struct writeback_control *wbc)
+{
+
+   struct inode *inode = mapping->host;
+   struct fuse_conn *fc = get_fuse_conn(inode);
+
+   return dax_writeback_mapping_range(mapping, fc->dax_dev, wbc);
+}
+
 static int fuse_writepages(struct address_space *mapping,
   struct writeback_control *wbc)
 {
@@ -4021,6 +4031,13 @@ static const struct address_space_operations 
fuse_file_aops  = {
.write_end  = fuse_write_end,
 };
 
+static const struct address_space_operations fuse_dax_file_aops  = {
+   .writepages = fuse_dax_writepages,
+   .direct_IO  = noop_direct_IO,
+   .set_page_dirty = noop_set_page_dirty,
+   .invalidatepage = noop_invalidatepage,
+};
+
 void fuse_init_file_inode(struct inode *inode)
 {
struct fuse_inode *fi = get_fuse_inode(inode);
@@ -4036,6 +4053,8 @@ void fuse_init_file_inode(struct inode *inode)
fi->writepages = RB_ROOT;
fi->dmap_tree = RB_ROOT_CACHED;
 
-   if (fc->dax_dev)
+   if (fc->dax_dev) {
inode->i_flags |= S_DAX;
+   inode->i_data.a_ops = _dax_file_aops;
+   }
 }
-- 
2.25.4
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

[PATCH v3 08/18] virtio_fs, dax: Set up virtio_fs dax_device

2020-08-19 Thread Vivek Goyal

From: Stefan Hajnoczi 

Setup a dax device.

Use the shm capability to find the cache entry and map it.

The DAX window is accessed by the fs/dax.c infrastructure and must have
struct pages (at least on x86).  Use devm_memremap_pages() to map the
DAX window PCI BAR and allocate struct page.

Signed-off-by: Stefan Hajnoczi 
Signed-off-by: Dr. David Alan Gilbert 
Signed-off-by: Vivek Goyal 
Signed-off-by: Sebastien Boeuf 
Signed-off-by: Liu Bo 
---
 fs/fuse/virtio_fs.c| 139 +
 include/uapi/linux/virtio_fs.h |   3 +
 2 files changed, 142 insertions(+)

diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
index 47ecdc15f25d..0fd3b5cecc5f 100644
--- a/fs/fuse/virtio_fs.c
+++ b/fs/fuse/virtio_fs.c
@@ -5,12 +5,16 @@
  */
 
 #include 
+#include 
+#include 
+#include 
 #include 
 #include 
 #include 
 #include 
 #include 
 #include 
+#include 
 #include "fuse_i.h"
 
 /* List of virtio-fs device instances and a lock for the list. Also provides
@@ -49,6 +53,12 @@ struct virtio_fs {
struct virtio_fs_vq *vqs;
unsigned int nvqs;   /* number of virtqueues */
unsigned int num_request_queues; /* number of request queues */
+   struct dax_device *dax_dev;
+
+   /* DAX memory window where file contents are mapped */
+   void *window_kaddr;
+   phys_addr_t window_phys_addr;
+   size_t window_len;
 };
 
 struct virtio_fs_forget_req {
@@ -686,6 +696,131 @@ static void virtio_fs_cleanup_vqs(struct virtio_device 
*vdev,
vdev->config->del_vqs(vdev);
 }
 
+/* Map a window offset to a page frame number.  The window offset will have
+ * been produced by .iomap_begin(), which maps a file offset to a window
+ * offset.
+ */
+static long virtio_fs_direct_access(struct dax_device *dax_dev, pgoff_t pgoff,
+   long nr_pages, void **kaddr, pfn_t *pfn)
+{
+   struct virtio_fs *fs = dax_get_private(dax_dev);
+   phys_addr_t offset = PFN_PHYS(pgoff);
+   size_t max_nr_pages = fs->window_len/PAGE_SIZE - pgoff;
+
+   if (kaddr)
+   *kaddr = fs->window_kaddr + offset;
+   if (pfn)
+   *pfn = phys_to_pfn_t(fs->window_phys_addr + offset,
+   PFN_DEV | PFN_MAP);
+   return nr_pages > max_nr_pages ? max_nr_pages : nr_pages;
+}
+
+static size_t virtio_fs_copy_from_iter(struct dax_device *dax_dev,
+  pgoff_t pgoff, void *addr,
+  size_t bytes, struct iov_iter *i)
+{
+   return copy_from_iter(addr, bytes, i);
+}
+
+static size_t virtio_fs_copy_to_iter(struct dax_device *dax_dev,
+  pgoff_t pgoff, void *addr,
+  size_t bytes, struct iov_iter *i)
+{
+   return copy_to_iter(addr, bytes, i);
+}
+
+static int virtio_fs_zero_page_range(struct dax_device *dax_dev,
+pgoff_t pgoff, size_t nr_pages)
+{
+   long rc;
+   void *kaddr;
+
+   rc = dax_direct_access(dax_dev, pgoff, nr_pages, , NULL);
+   if (rc < 0)
+   return rc;
+   memset(kaddr, 0, nr_pages << PAGE_SHIFT);
+   dax_flush(dax_dev, kaddr, nr_pages << PAGE_SHIFT);
+   return 0;
+}
+
+static const struct dax_operations virtio_fs_dax_ops = {
+   .direct_access = virtio_fs_direct_access,
+   .copy_from_iter = virtio_fs_copy_from_iter,
+   .copy_to_iter = virtio_fs_copy_to_iter,
+   .zero_page_range = virtio_fs_zero_page_range,
+};
+
+static void virtio_fs_cleanup_dax(void *data)
+{
+   struct dax_device *dax_dev = data;
+
+   kill_dax(dax_dev);
+   put_dax(dax_dev);
+}
+
+static int virtio_fs_setup_dax(struct virtio_device *vdev, struct virtio_fs 
*fs)
+{
+   struct virtio_shm_region cache_reg;
+   struct dev_pagemap *pgmap;
+   bool have_cache;
+
+   if (!IS_ENABLED(CONFIG_DAX_DRIVER))
+   return 0;
+
+   /* Get cache region */
+   have_cache = virtio_get_shm_region(vdev, _reg,
+  (u8)VIRTIO_FS_SHMCAP_ID_CACHE);
+   if (!have_cache) {
+   dev_notice(>dev, "%s: No cache capability\n", __func__);
+   return 0;
+   }
+
+   if (!devm_request_mem_region(>dev, cache_reg.addr, cache_reg.len,
+dev_name(>dev))) {
+   dev_warn(>dev, "could not reserve region addr=0x%llx"
+" len=0x%llx\n", cache_reg.addr, cache_reg.len);
+   return -EBUSY;
+   }
+
+   dev_notice(>dev, "Cache len: 0x%llx @ 0x%llx\n", cache_reg.len,
+  cache_reg.addr);
+
+   pgmap = devm_kzalloc(>dev, sizeof(*pgmap), GFP_KERNEL);
+   if (!pgmap)
+   return -ENOMEM;
+
+   pgmap->type = MEMORY_DEVICE_FS_DAX;
+
+   /* Ideally we would

[PATCH v3 09/18] fuse,virtiofs: Add a mount option to enable dax

2020-08-19 Thread Vivek Goyal

Add a mount option to allow using dax with virtio_fs.

Signed-off-by: Vivek Goyal 
---
 fs/fuse/fuse_i.h|  7 
 fs/fuse/inode.c |  3 ++
 fs/fuse/virtio_fs.c | 82 +
 3 files changed, 78 insertions(+), 14 deletions(-)

diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index cf5e675100ec..04fdd7c41bd1 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -486,10 +486,14 @@ struct fuse_fs_context {
bool destroy:1;
bool no_control:1;
bool no_force_umount:1;
+   bool dax:1;
unsigned int max_read;
unsigned int blksize;
const char *subtype;
 
+   /* DAX device, may be NULL */
+   struct dax_device *dax_dev;
+
/* fuse_dev pointer to fill in, should contain NULL on entry */
void **fudptr;
 };
@@ -761,6 +765,9 @@ struct fuse_conn {
 
/** List of device instances belonging to this connection */
struct list_head devices;
+
+   /** DAX device, non-NULL if DAX is supported */
+   struct dax_device *dax_dev;
 };
 
 static inline struct fuse_conn *get_fuse_conn_super(struct super_block *sb)
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 2ac5713c4c32..beac337ccc10 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -589,6 +589,8 @@ static int fuse_show_options(struct seq_file *m, struct 
dentry *root)
seq_printf(m, ",max_read=%u", fc->max_read);
if (sb->s_bdev && sb->s_blocksize != FUSE_DEFAULT_BLKSIZE)
seq_printf(m, ",blksize=%lu", sb->s_blocksize);
+   if (fc->dax_dev)
+   seq_printf(m, ",dax");
return 0;
 }
 
@@ -1207,6 +1209,7 @@ int fuse_fill_super_common(struct super_block *sb, struct 
fuse_fs_context *ctx)
fc->destroy = ctx->destroy;
fc->no_control = ctx->no_control;
fc->no_force_umount = ctx->no_force_umount;
+   fc->dax_dev = ctx->dax_dev;
 
err = -ENOMEM;
root = fuse_get_root_inode(sb, ctx->rootmode);
diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
index 0fd3b5cecc5f..741cad4abad8 100644
--- a/fs/fuse/virtio_fs.c
+++ b/fs/fuse/virtio_fs.c
@@ -13,6 +13,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include "fuse_i.h"
@@ -81,6 +82,45 @@ struct virtio_fs_req_work {
 static int virtio_fs_enqueue_req(struct virtio_fs_vq *fsvq,
 struct fuse_req *req, bool in_flight);
 
+enum {
+   OPT_DAX,
+};
+
+static const struct fs_parameter_spec virtio_fs_parameters[] = {
+   fsparam_flag("dax", OPT_DAX),
+   {}
+};
+
+static int virtio_fs_parse_param(struct fs_context *fc,
+struct fs_parameter *param)
+{
+   struct fs_parse_result result;
+   struct fuse_fs_context *ctx = fc->fs_private;
+   int opt;
+
+   opt = fs_parse(fc, virtio_fs_parameters, param, );
+   if (opt < 0)
+   return opt;
+
+   switch (opt) {
+   case OPT_DAX:
+   ctx->dax = 1;
+   break;
+   default:
+   return -EINVAL;
+   }
+
+   return 0;
+}
+
+static void virtio_fs_free_fc(struct fs_context *fc)
+{
+   struct fuse_fs_context *ctx = fc->fs_private;
+
+   if (ctx)
+   kfree(ctx);
+}
+
 static inline struct virtio_fs_vq *vq_to_fsvq(struct virtqueue *vq)
 {
struct virtio_fs *fs = vq->vdev->priv;
@@ -1220,23 +1260,27 @@ static const struct fuse_iqueue_ops virtio_fs_fiq_ops = 
{
.release= virtio_fs_fiq_release,
 };
 
-static int virtio_fs_fill_super(struct super_block *sb)
+static inline void virtio_fs_ctx_set_defaults(struct fuse_fs_context *ctx)
+{
+   ctx->rootmode = S_IFDIR;
+   ctx->default_permissions = 1;
+   ctx->allow_other = 1;
+   ctx->max_read = UINT_MAX;
+   ctx->blksize = 512;
+   ctx->destroy = true;
+   ctx->no_control = true;
+   ctx->no_force_umount = true;
+}
+
+static int virtio_fs_fill_super(struct super_block *sb, struct fs_context *fsc)
 {
struct fuse_conn *fc = get_fuse_conn_super(sb);
struct virtio_fs *fs = fc->iq.priv;
+   struct fuse_fs_context *ctx = fsc->fs_private;
unsigned int i;
int err;
-   struct fuse_fs_context ctx = {
-   .rootmode = S_IFDIR,
-   .default_permissions = 1,
-   .allow_other = 1,
-   .max_read = UINT_MAX,
-   .blksize = 512,
-   .destroy = true,
-   .no_control = true,
-   .no_force_umount = true,
-   };
 
+   virtio_fs_ctx_set_defaults(ctx);
mutex_lock(_fs_mutex);
 
/* After holding mutex, make sure virtiofs device is still there.
@@ -1260,8 +1304,10 @@ static int virtio_fs_fill_super(struct super_block *sb)
}
 
/* virti

[PATCH v3 11/18] fuse: implement FUSE_INIT map_alignment field

2020-08-19 Thread Vivek Goyal

The device communicates FUSE_SETUPMAPPING/FUSE_REMOVMAPPING alignment
constraints via the FUST_INIT map_alignment field.  Parse this field and
ensure our DAX mappings meet the alignment constraints.

We don't actually align anything differently since our mappings are
already 2MB aligned.  Just check the value when the connection is
established.  If it becomes necessary to honor arbitrary alignments in
the future we'll have to adjust how mappings are sized.

The upshot of this commit is that we can be confident that mappings will
work even when emulating x86 on Power and similar combinations where the
host page sizes are different.

Signed-off-by: Stefan Hajnoczi 
Signed-off-by: Vivek Goyal 
---
 fs/fuse/fuse_i.h  |  5 -
 fs/fuse/inode.c   | 18 --
 include/uapi/linux/fuse.h |  4 +++-
 3 files changed, 23 insertions(+), 4 deletions(-)

diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 478c940b05b4..4a46e35222c7 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -47,7 +47,10 @@
 /** Number of dentries for each connection in the control filesystem */
 #define FUSE_CTL_NUM_DENTRIES 5
 
-/* Default memory range size, 2MB */
+/*
+ * Default memory range size.  A power of 2 so it agrees with common FUSE_INIT
+ * map_alignment values 4KB and 64KB.
+ */
 #define FUSE_DAX_SZ(2*1024*1024)
 #define FUSE_DAX_SHIFT (21)
 #define FUSE_DAX_PAGES (FUSE_DAX_SZ/PAGE_SIZE)
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index b82eb61d63cc..947abdd776ca 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -980,9 +980,10 @@ static void process_init_reply(struct fuse_conn *fc, 
struct fuse_args *args,
 {
struct fuse_init_args *ia = container_of(args, typeof(*ia), args);
struct fuse_init_out *arg = >out;
+   bool ok = true;
 
if (error || arg->major != FUSE_KERNEL_VERSION)
-   fc->conn_error = 1;
+   ok = false;
else {
unsigned long ra_pages;
 
@@ -1045,6 +1046,13 @@ static void process_init_reply(struct fuse_conn *fc, 
struct fuse_args *args,
min_t(unsigned int, FUSE_MAX_MAX_PAGES,
max_t(unsigned int, arg->max_pages, 1));
}
+   if ((arg->flags & FUSE_MAP_ALIGNMENT) &&
+   (FUSE_DAX_SZ % (1ul << arg->map_alignment))) {
+   pr_err("FUSE: map_alignment %u incompatible"
+  " with dax mem range size %u\n",
+  arg->map_alignment, FUSE_DAX_SZ);
+   ok = false;
+   }
} else {
ra_pages = fc->max_read / PAGE_SIZE;
fc->no_lock = 1;
@@ -1060,6 +1068,11 @@ static void process_init_reply(struct fuse_conn *fc, 
struct fuse_args *args,
}
kfree(ia);
 
+   if (!ok) {
+   fc->conn_init = 0;
+   fc->conn_error = 1;
+   }
+
fuse_set_initialized(fc);
wake_up_all(>blocked_waitq);
 }
@@ -1082,7 +1095,8 @@ void fuse_send_init(struct fuse_conn *fc)
FUSE_WRITEBACK_CACHE | FUSE_NO_OPEN_SUPPORT |
FUSE_PARALLEL_DIROPS | FUSE_HANDLE_KILLPRIV | FUSE_POSIX_ACL |
FUSE_ABORT_ERROR | FUSE_MAX_PAGES | FUSE_CACHE_SYMLINKS |
-   FUSE_NO_OPENDIR_SUPPORT | FUSE_EXPLICIT_INVAL_DATA;
+   FUSE_NO_OPENDIR_SUPPORT | FUSE_EXPLICIT_INVAL_DATA |
+   FUSE_MAP_ALIGNMENT;
ia->args.opcode = FUSE_INIT;
ia->args.in_numargs = 1;
ia->args.in_args[0].size = sizeof(ia->in);
diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
index 373cada89815..5b85819e045f 100644
--- a/include/uapi/linux/fuse.h
+++ b/include/uapi/linux/fuse.h
@@ -313,7 +313,9 @@ struct fuse_file_lock {
  * FUSE_CACHE_SYMLINKS: cache READLINK responses
  * FUSE_NO_OPENDIR_SUPPORT: kernel supports zero-message opendir
  * FUSE_EXPLICIT_INVAL_DATA: only invalidate cached pages on explicit request
- * FUSE_MAP_ALIGNMENT: map_alignment field is valid
+ * FUSE_MAP_ALIGNMENT: init_out.map_alignment contains log2(byte alignment) for
+ *foffset and moffset fields in struct
+ *fuse_setupmapping_out and fuse_removemapping_one.
  */
 #define FUSE_ASYNC_READ(1 << 0)
 #define FUSE_POSIX_LOCKS   (1 << 1)
-- 
2.25.4
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

[PATCH v3 05/18] virtio: Implement get_shm_region for MMIO transport

2020-08-19 Thread Vivek Goyal

From: Sebastien Boeuf 

On MMIO a new set of registers is defined for finding SHM
regions.  Add their definitions and use them to find the region.

Signed-off-by: Sebastien Boeuf 
Cc: k...@vger.kernel.org
Cc: virtualizat...@lists.linux-foundation.org
Cc: "Michael S. Tsirkin" 
---
 drivers/virtio/virtio_mmio.c | 31 +++
 include/uapi/linux/virtio_mmio.h | 11 +++
 2 files changed, 42 insertions(+)

diff --git a/drivers/virtio/virtio_mmio.c b/drivers/virtio/virtio_mmio.c
index 627ac0487494..238383ff1064 100644
--- a/drivers/virtio/virtio_mmio.c
+++ b/drivers/virtio/virtio_mmio.c
@@ -498,6 +498,36 @@ static const char *vm_bus_name(struct virtio_device *vdev)
return vm_dev->pdev->name;
 }
 
+static bool vm_get_shm_region(struct virtio_device *vdev,
+ struct virtio_shm_region *region, u8 id)
+{
+   struct virtio_mmio_device *vm_dev = to_virtio_mmio_device(vdev);
+   u64 len, addr;
+
+   /* Select the region we're interested in */
+   writel(id, vm_dev->base + VIRTIO_MMIO_SHM_SEL);
+
+   /* Read the region size */
+   len = (u64) readl(vm_dev->base + VIRTIO_MMIO_SHM_LEN_LOW);
+   len |= (u64) readl(vm_dev->base + VIRTIO_MMIO_SHM_LEN_HIGH) << 32;
+
+   region->len = len;
+
+   /* Check if region length is -1. If that's the case, the shared memory
+* region does not exist and there is no need to proceed further.
+*/
+   if (len == ~(u64)0)
+   return false;
+
+   /* Read the region base address */
+   addr = (u64) readl(vm_dev->base + VIRTIO_MMIO_SHM_BASE_LOW);
+   addr |= (u64) readl(vm_dev->base + VIRTIO_MMIO_SHM_BASE_HIGH) << 32;
+
+   region->addr = addr;
+
+   return true;
+}
+
 static const struct virtio_config_ops virtio_mmio_config_ops = {
.get= vm_get,
.set= vm_set,
@@ -510,6 +540,7 @@ static const struct virtio_config_ops 
virtio_mmio_config_ops = {
.get_features   = vm_get_features,
.finalize_features = vm_finalize_features,
.bus_name   = vm_bus_name,
+   .get_shm_region = vm_get_shm_region,
 };
 
 
diff --git a/include/uapi/linux/virtio_mmio.h b/include/uapi/linux/virtio_mmio.h
index c4b09689ab64..0650f91bea6c 100644
--- a/include/uapi/linux/virtio_mmio.h
+++ b/include/uapi/linux/virtio_mmio.h
@@ -122,6 +122,17 @@
 #define VIRTIO_MMIO_QUEUE_USED_LOW 0x0a0
 #define VIRTIO_MMIO_QUEUE_USED_HIGH0x0a4
 
+/* Shared memory region id */
+#define VIRTIO_MMIO_SHM_SEL 0x0ac
+
+/* Shared memory region length, 64 bits in two halves */
+#define VIRTIO_MMIO_SHM_LEN_LOW 0x0b0
+#define VIRTIO_MMIO_SHM_LEN_HIGH0x0b4
+
+/* Shared memory region base address, 64 bits in two halves */
+#define VIRTIO_MMIO_SHM_BASE_LOW0x0b8
+#define VIRTIO_MMIO_SHM_BASE_HIGH   0x0bc
+
 /* Configuration atomicity value */
 #define VIRTIO_MMIO_CONFIG_GENERATION  0x0fc
 
-- 
2.25.4
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

[PATCH v3 18/18] fuse,virtiofs: Add logic to free up a memory range

2020-08-19 Thread Vivek Goyal

Add logic to free up a busy memory range. Freed memory range will be
returned to free pool. Add a worker which can be started to select
and free some busy memory ranges.

Process can also steal one of its busy dax ranges if free range is not
available. I will refer it to as direct reclaim.

If free range is not available and nothing can't be stolen from same
inode, caller waits on a waitq for free range to become available.

For reclaiming a range, as of now we need to hold following locks in
specified order.

down_write(>i_mmap_sem);
down_write(>i_dmap_sem);

We look for a free range in following order.

A. Try to get a free range.
B. If not, try direct reclaim.
C. If not, wait for a memory range to become free

Signed-off-by: Vivek Goyal 
Signed-off-by: Liu Bo 
---
 fs/fuse/file.c  | 482 +++-
 fs/fuse/fuse_i.h|  25 +++
 fs/fuse/inode.c |   4 +
 fs/fuse/virtio_fs.c |   5 +
 4 files changed, 508 insertions(+), 8 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 723602813ad6..12c4716fc1e5 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -8,6 +8,7 @@
 
 #include "fuse_i.h"
 
+#include 
 #include 
 #include 
 #include 
@@ -35,6 +36,8 @@ static struct page **fuse_pages_alloc(unsigned int npages, 
gfp_t flags,
return pages;
 }
 
+static struct fuse_dax_mapping *alloc_dax_mapping_reclaim(struct fuse_conn *fc,
+   struct inode *inode);
 static int fuse_send_open(struct fuse_conn *fc, u64 nodeid, struct file *file,
  int opcode, struct fuse_open_out *outargp)
 {
@@ -191,6 +194,26 @@ static void fuse_link_write_file(struct file *file)
spin_unlock(>lock);
 }
 
+static void
+__kick_dmap_free_worker(struct fuse_conn *fc, unsigned long delay_ms)
+{
+   unsigned long free_threshold;
+
+   /* If number of free ranges are below threshold, start reclaim */
+   free_threshold = max((fc->nr_ranges * FUSE_DAX_RECLAIM_THRESHOLD)/100,
+   (unsigned long)1);
+   if (fc->nr_free_ranges < free_threshold)
+   queue_delayed_work(system_long_wq, >dax_free_work,
+  msecs_to_jiffies(delay_ms));
+}
+
+static void kick_dmap_free_worker(struct fuse_conn *fc, unsigned long delay_ms)
+{
+   spin_lock(>lock);
+   __kick_dmap_free_worker(fc, delay_ms);
+   spin_unlock(>lock);
+}
+
 static struct fuse_dax_mapping *alloc_dax_mapping(struct fuse_conn *fc)
 {
struct fuse_dax_mapping *dmap = NULL;
@@ -199,7 +222,7 @@ static struct fuse_dax_mapping *alloc_dax_mapping(struct 
fuse_conn *fc)
 
if (fc->nr_free_ranges <= 0) {
spin_unlock(>lock);
-   return NULL;
+   goto out_kick;
}
 
WARN_ON(list_empty(>free_ranges));
@@ -210,6 +233,9 @@ static struct fuse_dax_mapping *alloc_dax_mapping(struct 
fuse_conn *fc)
list_del_init(>list);
fc->nr_free_ranges--;
spin_unlock(>lock);
+
+out_kick:
+   kick_dmap_free_worker(fc, 0);
return dmap;
 }
 
@@ -236,6 +262,7 @@ static void __dmap_add_to_free_pool(struct fuse_conn *fc,
 {
list_add_tail(>list, >free_ranges);
fc->nr_free_ranges++;
+   wake_up(>dax_range_waitq);
 }
 
 static void dmap_add_to_free_pool(struct fuse_conn *fc,
@@ -279,6 +306,12 @@ static int fuse_setup_one_mapping(struct inode *inode, 
unsigned long start_idx,
return err;
dmap->writable = writable;
if (!upgrade) {
+   /*
+* We don't take a refernce on inode. inode is valid right now
+* and when inode is going away, cleanup logic should first
+* cleanup dmap entries.
+*/
+   dmap->inode = inode;
dmap->itn.start = dmap->itn.last = start_idx;
/* Protected by fi->i_dmap_sem */
interval_tree_insert(>itn, >dmap_tree);
@@ -357,6 +390,7 @@ static void dmap_reinit_add_to_free_pool(struct fuse_conn 
*fc,
 "window_offset=0x%llx length=0x%llx\n", dmap->itn.start,
 dmap->itn.last, dmap->window_offset, dmap->length);
__dmap_remove_busy_list(fc, dmap);
+   dmap->inode = NULL;
dmap->itn.start = dmap->itn.last = 0;
__dmap_add_to_free_pool(fc, dmap);
 }
@@ -384,6 +418,8 @@ static void inode_reclaim_dmap_range(struct fuse_conn *fc, 
struct inode *inode,
if (!node)
break;
dmap = node_to_dmap(node);
+   /* inode is going away. There should not be any users of dmap */
+   WARN_ON(refcount_read(>refcnt) > 1);
interval_tree_remove(>itn, >dmap_tree);
num++;
list_add(>list, _remove);
@@ -4

[PATCH v3 02/18] dax: Create a range version of dax_layout_busy_page()

2020-08-19 Thread Vivek Goyal

virtiofs device has a range of memory which is mapped into file inodes
using dax. This memory is mapped in qemu on host and maps different
sections of real file on host. Size of this memory is limited
(determined by administrator) and depending on filesystem size, we will
soon reach a situation where all the memory is in use and we need to
reclaim some.

As part of reclaim process, we will need to make sure that there are
no active references to pages (taken by get_user_pages()) on the memory
range we are trying to reclaim. I am planning to use
dax_layout_busy_page() for this. But in current form this is per inode
and scans through all the pages of the inode.

We want to reclaim only a portion of memory (say 2MB page). So we want
to make sure that only that 2MB range of pages do not have any
references  (and don't want to unmap all the pages of inode).

Hence, create a range version of this function named
dax_layout_busy_page_range() which can be used to pass a range which
needs to be unmapped.

Cc: Dan Williams 
Cc: linux-nvdimm@lists.01.org
Cc: Jan Kara 
Cc: Vishal L Verma 
Cc: "Weiny, Ira" 
Signed-off-by: Vivek Goyal 
---
 fs/dax.c| 29 +++--
 include/linux/dax.h |  6 ++
 2 files changed, 29 insertions(+), 6 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 95341af1a966..ddd705251d9f 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -559,7 +559,7 @@ static void *grab_mapping_entry(struct xa_state *xas,
 }
 
 /**
- * dax_layout_busy_page - find first pinned page in @mapping
+ * dax_layout_busy_page_range - find first pinned page in @mapping
  * @mapping: address space to scan for a page with ref count > 1
  *
  * DAX requires ZONE_DEVICE mapped pages. These pages are never
@@ -572,13 +572,19 @@ static void *grab_mapping_entry(struct xa_state *xas,
  * establishment of new mappings in this address_space. I.e. it expects
  * to be able to run unmap_mapping_range() and subsequently not race
  * mapping_mapped() becoming true.
+ *
+ * Partial pages are included. If 'end' is LLONG_MAX, pages in the range
+ * from 'start' to end of the file are inluded.
  */
-struct page *dax_layout_busy_page(struct address_space *mapping)
+struct page *dax_layout_busy_page_range(struct address_space *mapping,
+   loff_t start, loff_t end)
 {
-   XA_STATE(xas, >i_pages, 0);
void *entry;
unsigned int scanned = 0;
struct page *page = NULL;
+   pgoff_t start_idx = start >> PAGE_SHIFT;
+   pgoff_t end_idx;
+   XA_STATE(xas, >i_pages, start_idx);
 
/*
 * In the 'limited' case get_user_pages() for dax is disabled.
@@ -589,6 +595,11 @@ struct page *dax_layout_busy_page(struct address_space 
*mapping)
if (!dax_mapping(mapping) || !mapping_mapped(mapping))
return NULL;
 
+   /* If end == LLONG_MAX, all pages from start to till end of file */
+   if (end == LLONG_MAX)
+   end_idx = ULONG_MAX;
+   else
+   end_idx = end >> PAGE_SHIFT;
/*
 * If we race get_user_pages_fast() here either we'll see the
 * elevated page count in the iteration and wait, or
@@ -596,15 +607,15 @@ struct page *dax_layout_busy_page(struct address_space 
*mapping)
 * against is no longer mapped in the page tables and bail to the
 * get_user_pages() slow path.  The slow path is protected by
 * pte_lock() and pmd_lock(). New references are not taken without
-* holding those locks, and unmap_mapping_range() will not zero the
+* holding those locks, and unmap_mapping_pages() will not zero the
 * pte or pmd without holding the respective lock, so we are
 * guaranteed to either see new references or prevent new
 * references from being established.
 */
-   unmap_mapping_range(mapping, 0, 0, 0);
+   unmap_mapping_pages(mapping, start_idx, end_idx - start_idx + 1, 0);
 
xas_lock_irq();
-   xas_for_each(, entry, ULONG_MAX) {
+   xas_for_each(, entry, end_idx) {
if (WARN_ON_ONCE(!xa_is_value(entry)))
continue;
if (unlikely(dax_is_locked(entry)))
@@ -625,6 +636,12 @@ struct page *dax_layout_busy_page(struct address_space 
*mapping)
xas_unlock_irq();
return page;
 }
+EXPORT_SYMBOL_GPL(dax_layout_busy_page_range);
+
+struct page *dax_layout_busy_page(struct address_space *mapping)
+{
+   return dax_layout_busy_page_range(mapping, 0, LLONG_MAX);
+}
 EXPORT_SYMBOL_GPL(dax_layout_busy_page);
 
 static int __dax_invalidate_entry(struct address_space *mapping,
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 6904d4e0b2e0..9016929db4c6 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -141,6 +141,7 @@ int dax_writeback_mapping_range(struct address_space 
*mapping,
struct dax_device *dax_dev, struct writeback_control *wbc);
 
 struct pa

[PATCH v3 04/18] virtio: Implement get_shm_region for PCI transport

2020-08-19 Thread Vivek Goyal

From: Sebastien Boeuf 

On PCI the shm regions are found using capability entries;
find a region by searching for the capability.

Signed-off-by: Sebastien Boeuf 
Signed-off-by: Dr. David Alan Gilbert 
Signed-off-by: kbuild test robot 
Acked-by: Michael S. Tsirkin 
Cc: k...@vger.kernel.org
Cc: virtualizat...@lists.linux-foundation.org
Cc: "Michael S. Tsirkin" 
---
 drivers/virtio/virtio_pci_modern.c | 95 ++
 include/uapi/linux/virtio_pci.h| 11 +++-
 2 files changed, 105 insertions(+), 1 deletion(-)

diff --git a/drivers/virtio/virtio_pci_modern.c 
b/drivers/virtio/virtio_pci_modern.c
index 3e14e700b231..3d6ae5a5e252 100644
--- a/drivers/virtio/virtio_pci_modern.c
+++ b/drivers/virtio/virtio_pci_modern.c
@@ -444,6 +444,99 @@ static void del_vq(struct virtio_pci_vq_info *info)
vring_del_virtqueue(vq);
 }
 
+static int virtio_pci_find_shm_cap(struct pci_dev *dev, u8 required_id,
+  u8 *bar, u64 *offset, u64 *len)
+{
+   int pos;
+
+   for (pos = pci_find_capability(dev, PCI_CAP_ID_VNDR); pos > 0;
+pos = pci_find_next_capability(dev, pos, PCI_CAP_ID_VNDR)) {
+   u8 type, cap_len, id;
+   u32 tmp32;
+   u64 res_offset, res_length;
+
+   pci_read_config_byte(dev, pos + offsetof(struct virtio_pci_cap,
+cfg_type), );
+   if (type != VIRTIO_PCI_CAP_SHARED_MEMORY_CFG)
+   continue;
+
+   pci_read_config_byte(dev, pos + offsetof(struct virtio_pci_cap,
+cap_len), _len);
+   if (cap_len != sizeof(struct virtio_pci_cap64)) {
+   dev_err(>dev, "%s: shm cap with bad size offset:"
+   " %d size: %d\n", __func__, pos, cap_len);
+   continue;
+   }
+
+   pci_read_config_byte(dev, pos + offsetof(struct virtio_pci_cap,
+id), );
+   if (id != required_id)
+   continue;
+
+   /* Type, and ID match, looks good */
+   pci_read_config_byte(dev, pos + offsetof(struct virtio_pci_cap,
+bar), bar);
+
+   /* Read the lower 32bit of length and offset */
+   pci_read_config_dword(dev, pos + offsetof(struct virtio_pci_cap,
+ offset), );
+   res_offset = tmp32;
+   pci_read_config_dword(dev, pos + offsetof(struct virtio_pci_cap,
+ length), );
+   res_length = tmp32;
+
+   /* and now the top half */
+   pci_read_config_dword(dev,
+ pos + offsetof(struct virtio_pci_cap64,
+offset_hi), );
+   res_offset |= ((u64)tmp32) << 32;
+   pci_read_config_dword(dev,
+ pos + offsetof(struct virtio_pci_cap64,
+length_hi), );
+   res_length |= ((u64)tmp32) << 32;
+
+   *offset = res_offset;
+   *len = res_length;
+
+   return pos;
+   }
+   return 0;
+}
+
+static bool vp_get_shm_region(struct virtio_device *vdev,
+ struct virtio_shm_region *region, u8 id)
+{
+   struct virtio_pci_device *vp_dev = to_vp_device(vdev);
+   struct pci_dev *pci_dev = vp_dev->pci_dev;
+   u8 bar;
+   u64 offset, len;
+   phys_addr_t phys_addr;
+   size_t bar_len;
+
+   if (!virtio_pci_find_shm_cap(pci_dev, id, , , ))
+   return false;
+
+   phys_addr = pci_resource_start(pci_dev, bar);
+   bar_len = pci_resource_len(pci_dev, bar);
+
+   if ((offset + len) < offset) {
+   dev_err(_dev->dev, "%s: cap offset+len overflow detected\n",
+   __func__);
+   return false;
+   }
+
+   if (offset + len > bar_len) {
+   dev_err(_dev->dev, "%s: bar shorter than cap offset+len\n",
+   __func__);
+   return false;
+   }
+
+   region->len = len;
+   region->addr = (u64) phys_addr + offset;
+
+   return true;
+}
+
 static const struct virtio_config_ops virtio_pci_config_nodev_ops = {
.get= NULL,
.set= NULL,
@@ -458,6 +551,7 @@ static const struct virtio_config_ops 
virtio_pci_config_nodev_ops = {
.bus_name   = vp_bus_name,
.set_vq_affinity = vp_set_vq_affinity,
.get_vq_affinity = vp_get_vq_affinity,
+   .get_shm_region  = vp_get_shm_region,
 };
 
 static const struct virtio_config_ops virtio_pci_config_ops = {
@@ -474,6 +568,7 @@

[PATCH v3 17/18] fuse,virtiofs: Maintain a list of busy elements

2020-08-19 Thread Vivek Goyal

This list will be used selecting fuse_dax_mapping to free when number of
free mappings drops below a threshold.

Signed-off-by: Vivek Goyal 
---
 fs/fuse/file.c   | 22 ++
 fs/fuse/fuse_i.h |  7 +++
 fs/fuse/inode.c  |  4 
 3 files changed, 33 insertions(+)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index aaa57c625af7..723602813ad6 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -213,6 +213,23 @@ static struct fuse_dax_mapping *alloc_dax_mapping(struct 
fuse_conn *fc)
return dmap;
 }
 
+/* This assumes fc->lock is held */
+static void __dmap_remove_busy_list(struct fuse_conn *fc,
+   struct fuse_dax_mapping *dmap)
+{
+   list_del_init(>busy_list);
+   WARN_ON(fc->nr_busy_ranges == 0);
+   fc->nr_busy_ranges--;
+}
+
+static void dmap_remove_busy_list(struct fuse_conn *fc,
+ struct fuse_dax_mapping *dmap)
+{
+   spin_lock(>lock);
+   __dmap_remove_busy_list(fc, dmap);
+   spin_unlock(>lock);
+}
+
 /* This assumes fc->lock is held */
 static void __dmap_add_to_free_pool(struct fuse_conn *fc,
struct fuse_dax_mapping *dmap)
@@ -266,6 +283,10 @@ static int fuse_setup_one_mapping(struct inode *inode, 
unsigned long start_idx,
/* Protected by fi->i_dmap_sem */
interval_tree_insert(>itn, >dmap_tree);
fi->nr_dmaps++;
+   spin_lock(>lock);
+   list_add_tail(>busy_list, >busy_ranges);
+   fc->nr_busy_ranges++;
+   spin_unlock(>lock);
}
return 0;
 }
@@ -335,6 +356,7 @@ static void dmap_reinit_add_to_free_pool(struct fuse_conn 
*fc,
pr_debug("fuse: freeing memory range start_idx=0x%lx end_idx=0x%lx "
 "window_offset=0x%llx length=0x%llx\n", dmap->itn.start,
 dmap->itn.last, dmap->window_offset, dmap->length);
+   __dmap_remove_busy_list(fc, dmap);
dmap->itn.start = dmap->itn.last = 0;
__dmap_add_to_free_pool(fc, dmap);
 }
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index e555c9a33359..400a19a464ca 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -80,6 +80,9 @@ struct fuse_dax_mapping {
/* For interval tree in file/inode */
struct interval_tree_node itn;
 
+   /* Will connect in fc->busy_ranges to keep track busy memory */
+   struct list_head busy_list;
+
/** Position in DAX window */
u64 window_offset;
 
@@ -812,6 +815,10 @@ struct fuse_conn {
/** DAX device, non-NULL if DAX is supported */
struct dax_device *dax_dev;
 
+   /* List of memory ranges which are busy */
+   unsigned long nr_busy_ranges;
+   struct list_head busy_ranges;
+
/*
 * DAX Window Free Ranges
 */
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 3735bc5fdfa2..671e84e3dd99 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -636,6 +636,8 @@ static void fuse_free_dax_mem_ranges(struct list_head 
*mem_list)
/* Free All allocated elements */
list_for_each_entry_safe(range, temp, mem_list, list) {
list_del(>list);
+   if (!list_empty(>busy_list))
+   list_del(>busy_list);
kfree(range);
}
 }
@@ -680,6 +682,7 @@ static int fuse_dax_mem_range_init(struct fuse_conn *fc,
 */
range->window_offset = i * FUSE_DAX_SZ;
range->length = FUSE_DAX_SZ;
+   INIT_LIST_HEAD(>busy_list);
list_add_tail(>list, _ranges);
}
 
@@ -727,6 +730,7 @@ void fuse_conn_init(struct fuse_conn *fc, struct 
user_namespace *user_ns,
fc->user_ns = get_user_ns(user_ns);
fc->max_pages = FUSE_DEFAULT_MAX_PAGES_PER_REQ;
INIT_LIST_HEAD(>free_ranges);
+   INIT_LIST_HEAD(>busy_ranges);
 }
 EXPORT_SYMBOL_GPL(fuse_conn_init);
 
-- 
2.25.4
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

[PATCH v3 07/18] fuse: Get rid of no_mount_options

2020-08-19 Thread Vivek Goyal

This option was introduced so that for virtio_fs we don't show any mounts
options fuse_show_options(). Because we don't offer any of these options
to be controlled by mounter.

Very soon we are planning to introduce option "dax" which mounter should
be able to specify. And no_mount_options does not work anymore. What
we need is a per mount option specific flag so that filesystem can
specify which options to show.

Add few such flags to control the behavior in more fine grained manner
and get rid of no_mount_options.

Signed-off-by: Vivek Goyal 
---
 fs/fuse/fuse_i.h| 14 ++
 fs/fuse/inode.c | 22 ++
 fs/fuse/virtio_fs.c |  1 -
 3 files changed, 24 insertions(+), 13 deletions(-)

diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 740a8a7d7ae6..cf5e675100ec 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -471,18 +471,21 @@ struct fuse_fs_context {
int fd;
unsigned int rootmode;
kuid_t user_id;
+   bool user_id_show;
kgid_t group_id;
+   bool group_id_show;
bool is_bdev:1;
bool fd_present:1;
bool rootmode_present:1;
bool user_id_present:1;
bool group_id_present:1;
bool default_permissions:1;
+   bool default_permissions_show:1;
bool allow_other:1;
+   bool allow_other_show:1;
bool destroy:1;
bool no_control:1;
bool no_force_umount:1;
-   bool no_mount_options:1;
unsigned int max_read;
unsigned int blksize;
const char *subtype;
@@ -512,9 +515,11 @@ struct fuse_conn {
 
/** The user id for this mount */
kuid_t user_id;
+   bool user_id_show:1;
 
/** The group id for this mount */
kgid_t group_id;
+   bool group_id_show:1;
 
/** The pid namespace for this mount */
struct pid_namespace *pid_ns;
@@ -698,10 +703,14 @@ struct fuse_conn {
 
/** Check permissions based on the file mode or not? */
unsigned default_permissions:1;
+   bool default_permissions_show:1;
 
/** Allow other than the mounter user to access the filesystem ? */
unsigned allow_other:1;
 
+   /** Show allow_other in mount options */
+   bool allow_other_show:1;
+
/** Does the filesystem support copy_file_range? */
unsigned no_copy_file_range:1;
 
@@ -717,9 +726,6 @@ struct fuse_conn {
/** Do not allow MNT_FORCE umount */
unsigned int no_force_umount:1;
 
-   /* Do not show mount options */
-   unsigned int no_mount_options:1;
-
/** The number of requests waiting for completion */
atomic_t num_waiting;
 
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index bba747520e9b..2ac5713c4c32 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -535,10 +535,12 @@ static int fuse_parse_param(struct fs_context *fc, struct 
fs_parameter *param)
 
case OPT_DEFAULT_PERMISSIONS:
ctx->default_permissions = true;
+   ctx->default_permissions_show = true;
break;
 
case OPT_ALLOW_OTHER:
ctx->allow_other = true;
+   ctx->allow_other_show = true;
break;
 
case OPT_MAX_READ:
@@ -573,14 +575,15 @@ static int fuse_show_options(struct seq_file *m, struct 
dentry *root)
struct super_block *sb = root->d_sb;
struct fuse_conn *fc = get_fuse_conn_super(sb);
 
-   if (fc->no_mount_options)
-   return 0;
-
-   seq_printf(m, ",user_id=%u", from_kuid_munged(fc->user_ns, 
fc->user_id));
-   seq_printf(m, ",group_id=%u", from_kgid_munged(fc->user_ns, 
fc->group_id));
-   if (fc->default_permissions)
+   if (fc->user_id_show)
+   seq_printf(m, ",user_id=%u",
+  from_kuid_munged(fc->user_ns, fc->user_id));
+   if (fc->group_id_show)
+   seq_printf(m, ",group_id=%u",
+  from_kgid_munged(fc->user_ns, fc->group_id));
+   if (fc->default_permissions && fc->default_permissions_show)
seq_puts(m, ",default_permissions");
-   if (fc->allow_other)
+   if (fc->allow_other && fc->allow_other_show)
seq_puts(m, ",allow_other");
if (fc->max_read != ~0)
seq_printf(m, ",max_read=%u", fc->max_read);
@@ -1193,14 +1196,17 @@ int fuse_fill_super_common(struct super_block *sb, 
struct fuse_fs_context *ctx)
sb->s_flags |= SB_POSIXACL;
 
fc->default_permissions = ctx->default_permissions;
+   fc->default_permissions_show = ctx->default_permissions_show;
fc->allow_other = ctx->allow_other;
+   fc->allow_other_show = ctx->allow_other_show;
fc->user_id = ctx->user_id;
+   fc->user_id_show = ctx->user_id_show

[PATCH v3 03/18] virtio: Add get_shm_region method

2020-08-19 Thread Vivek Goyal

From: Sebastien Boeuf 

Virtio defines 'shared memory regions' that provide a continuously
shared region between the host and guest.

Provide a method to find a particular region on a device.

Signed-off-by: Sebastien Boeuf 
Signed-off-by: Dr. David Alan Gilbert 
Acked-by: Michael S. Tsirkin 
Cc: k...@vger.kernel.org
Cc: virtualizat...@lists.linux-foundation.org
Cc: "Michael S. Tsirkin" 
---
 include/linux/virtio_config.h | 17 +
 1 file changed, 17 insertions(+)

diff --git a/include/linux/virtio_config.h b/include/linux/virtio_config.h
index 8fe857e27ef3..4b8e38c5c4d8 100644
--- a/include/linux/virtio_config.h
+++ b/include/linux/virtio_config.h
@@ -11,6 +11,11 @@
 
 struct irq_affinity;
 
+struct virtio_shm_region {
+   u64 addr;
+   u64 len;
+};
+
 /**
  * virtio_config_ops - operations for configuring a virtio device
  * Note: Do not assume that a transport implements all of the operations
@@ -66,6 +71,7 @@ struct irq_affinity;
  *  the caller can then copy.
  * @set_vq_affinity: set the affinity for a virtqueue (optional).
  * @get_vq_affinity: get the affinity for a virtqueue (optional).
+ * @get_shm_region: get a shared memory region based on the index.
  */
 typedef void vq_callback_t(struct virtqueue *);
 struct virtio_config_ops {
@@ -89,6 +95,8 @@ struct virtio_config_ops {
   const struct cpumask *cpu_mask);
const struct cpumask *(*get_vq_affinity)(struct virtio_device *vdev,
int index);
+   bool (*get_shm_region)(struct virtio_device *vdev,
+  struct virtio_shm_region *region, u8 id);
 };
 
 /* If driver didn't advertise the feature, it will never appear. */
@@ -251,6 +259,15 @@ int virtqueue_set_affinity(struct virtqueue *vq, const 
struct cpumask *cpu_mask)
return 0;
 }
 
+static inline
+bool virtio_get_shm_region(struct virtio_device *vdev,
+  struct virtio_shm_region *region, u8 id)
+{
+   if (!vdev->config->get_shm_region)
+   return false;
+   return vdev->config->get_shm_region(vdev, region, id);
+}
+
 static inline bool virtio_is_little_endian(struct virtio_device *vdev)
 {
return virtio_has_feature(vdev, VIRTIO_F_VERSION_1) ||
-- 
2.25.4
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

[PATCH v3 01/18] dax: Modify bdev_dax_pgoff() to handle NULL bdev

2020-08-19 Thread Vivek Goyal

virtiofs does not have a block device but it has dax device.
Modify bdev_dax_pgoff() to be able to handle that.

If there is no bdev, that means dax offset is 0. (It can't be a partition
block device starting at an offset in dax device).

This is little hackish. There have been discussions about getting rid
of dax not supporting partitions.

https://lore.kernel.org/linux-fsdevel/20200107125159.ga15...@infradead.org/

IMHO, this path can easily break exisitng users. For example
ioctl(BLKPG_ADD_PARTITION) will start breaking on block devices
supporting DAX. Also, I personally find it very useful to be able to
partition dax devices and still be able to use DAX.

Alternatively, I tried to store offset into dax device information in iomap
interface, but that got NACKed.

https://lore.kernel.org/linux-fsdevel/20200217133117.gb20...@infradead.org/

I can't think of a good path to solve this issue properly. So to make
progress, it seems this patch is least bad option for now and I hope
we can take it.

Signed-off-by: Stefan Hajnoczi 
Signed-off-by: Vivek Goyal 
Reviewed-by: Jan Kara 
Cc: Christoph Hellwig 
Cc: Dan Williams 
Cc: Jan Kara 
Cc: Vishal L Verma 
Cc: "Weiny, Ira" 
Cc: linux-nvdimm@lists.01.org
---
 drivers/dax/super.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index c82cbcb64202..505165752d8f 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -46,7 +46,8 @@ EXPORT_SYMBOL_GPL(dax_read_unlock);
 int bdev_dax_pgoff(struct block_device *bdev, sector_t sector, size_t size,
pgoff_t *pgoff)
 {
-   phys_addr_t phys_off = (get_start_sect(bdev) + sector) * 512;
+   sector_t start_sect = bdev ? get_start_sect(bdev) : 0;
+   phys_addr_t phys_off = (start_sect + sector) * 512;
 
if (pgoff)
*pgoff = PHYS_PFN(phys_off);
-- 
2.25.4
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH v2 02/20] dax: Create a range version of dax_layout_busy_page()

2020-08-17 Thread Vivek Goyal

On Mon, Aug 17, 2020 at 06:53:39PM +0200, Jan Kara wrote:
> On Fri 07-08-20 15:55:08, Vivek Goyal wrote:
> > virtiofs device has a range of memory which is mapped into file inodes
> > using dax. This memory is mapped in qemu on host and maps different
> > sections of real file on host. Size of this memory is limited
> > (determined by administrator) and depending on filesystem size, we will
> > soon reach a situation where all the memory is in use and we need to
> > reclaim some.
> > 
> > As part of reclaim process, we will need to make sure that there are
> > no active references to pages (taken by get_user_pages()) on the memory
> > range we are trying to reclaim. I am planning to use
> > dax_layout_busy_page() for this. But in current form this is per inode
> > and scans through all the pages of the inode.
> > 
> > We want to reclaim only a portion of memory (say 2MB page). So we want
> > to make sure that only that 2MB range of pages do not have any
> > references  (and don't want to unmap all the pages of inode).
> > 
> > Hence, create a range version of this function named
> > dax_layout_busy_page_range() which can be used to pass a range which
> > needs to be unmapped.
> > 
> > Cc: Dan Williams 
> > Cc: linux-nvdimm@lists.01.org
> > Signed-off-by: Vivek Goyal 
> 
> The API looks OK. Some comments WRT the implementation below.
> 
> > diff --git a/fs/dax.c b/fs/dax.c
> > index 11b16729b86f..0d51b0fbb489 100644
> > --- a/fs/dax.c
> > +++ b/fs/dax.c
> > @@ -558,27 +558,20 @@ static void *grab_mapping_entry(struct xa_state *xas,
> > return xa_mk_internal(VM_FAULT_FALLBACK);
> >  }
> >  
> > -/**
> > - * dax_layout_busy_page - find first pinned page in @mapping
> > - * @mapping: address space to scan for a page with ref count > 1
> > - *
> > - * DAX requires ZONE_DEVICE mapped pages. These pages are never
> > - * 'onlined' to the page allocator so they are considered idle when
> > - * page->count == 1. A filesystem uses this interface to determine if
> > - * any page in the mapping is busy, i.e. for DMA, or other
> > - * get_user_pages() usages.
> > - *
> > - * It is expected that the filesystem is holding locks to block the
> > - * establishment of new mappings in this address_space. I.e. it expects
> > - * to be able to run unmap_mapping_range() and subsequently not race
> > - * mapping_mapped() becoming true.
> > +/*
> > + * Partial pages are included. If end is LLONG_MAX, pages in the range from
> > + * start to end of the file are inluded.
> >   */
> 
> I think the big kerneldoc comment should stay with
> dax_layout_busy_page_range() since dax_layout_busy_page() will be just a
> trivial wrapper around it..

Hi Jan,

Thanks for the review.

Will move kerneldoc comment.


> 
> > -struct page *dax_layout_busy_page(struct address_space *mapping)
> > +struct page *dax_layout_busy_page_range(struct address_space *mapping,
> > +   loff_t start, loff_t end)
> >  {
> > -   XA_STATE(xas, >i_pages, 0);
> > void *entry;
> > unsigned int scanned = 0;
> > struct page *page = NULL;
> > +   pgoff_t start_idx = start >> PAGE_SHIFT;
> > +   pgoff_t end_idx = end >> PAGE_SHIFT;
> > +   XA_STATE(xas, >i_pages, start_idx);
> > +   loff_t len, lstart = round_down(start, PAGE_SIZE);
> >  
> > /*
> >  * In the 'limited' case get_user_pages() for dax is disabled.
> > @@ -589,6 +582,22 @@ struct page *dax_layout_busy_page(struct address_space 
> > *mapping)
> > if (!dax_mapping(mapping) || !mapping_mapped(mapping))
> > return NULL;
> >  
> > +   /* If end == LLONG_MAX, all pages from start to till end of file */
> > +   if (end == LLONG_MAX) {
> > +   end_idx = ULONG_MAX;
> > +   len = 0;
> > +   } else {
> > +   /* length is being calculated from lstart and not start.
> > +* This is due to behavior of unmap_mapping_range(). If
> > +* start is say 4094 and end is on 4096 then we want to
> > +* unamp two pages, idx 0 and 1. But unmap_mapping_range()
> > +* will unmap only page at idx 0. If we calculate len
> > +* from the rounded down start, this problem should not
> > +* happen.
> > +*/
> > +   len = end - lstart + 1;
> > +   }
> 
> Maybe it would be more understandable to use
>   unmap_mapping_pages(mapping, start_idx, end_idx - start_idx + 1);
> below and avoid all t

[PATCH v2 01/20] dax: Modify bdev_dax_pgoff() to handle NULL bdev

2020-08-07 Thread Vivek Goyal

virtiofs does not have a block device but it has dax device.
Modify bdev_dax_pgoff() to be able to handle that.

If there is no bdev, that means dax offset is 0. (It can't be a partition
block device starting at an offset in dax device).

This is little hackish. There have been discussions about getting rid
of dax not supporting partitions.

https://lore.kernel.org/linux-fsdevel/20200107125159.ga15...@infradead.org/

IMHO, this path can easily break exisitng users. For example
ioctl(BLKPG_ADD_PARTITION) will start breaking on block devices
supporting DAX. Also, I personally find it very useful to be able to
partition dax devices and still be able to use DAX.

Alternatively, I tried to store offset into dax device information in iomap
interface, but that got NACKed.

https://lore.kernel.org/linux-fsdevel/20200217133117.gb20...@infradead.org/

I can't think of a good path to solve this issue properly. So to make
progress, it seems this patch is least bad option for now and I hope
we can take it.

Signed-off-by: Stefan Hajnoczi 
Signed-off-by: Vivek Goyal 
Cc: Christoph Hellwig 
Cc: Dan Williams 
Cc: linux-nvdimm@lists.01.org
---
 drivers/dax/super.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 8e32345be0f7..c4bec437e88b 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -46,7 +46,8 @@ EXPORT_SYMBOL_GPL(dax_read_unlock);
 int bdev_dax_pgoff(struct block_device *bdev, sector_t sector, size_t size,
pgoff_t *pgoff)
 {
-   phys_addr_t phys_off = (get_start_sect(bdev) + sector) * 512;
+   sector_t start_sect = bdev ? get_start_sect(bdev) : 0;
+   phys_addr_t phys_off = (start_sect + sector) * 512;
 
if (pgoff)
*pgoff = PHYS_PFN(phys_off);
-- 
2.25.4
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

[PATCH v2 02/20] dax: Create a range version of dax_layout_busy_page()

2020-08-07 Thread Vivek Goyal

virtiofs device has a range of memory which is mapped into file inodes
using dax. This memory is mapped in qemu on host and maps different
sections of real file on host. Size of this memory is limited
(determined by administrator) and depending on filesystem size, we will
soon reach a situation where all the memory is in use and we need to
reclaim some.

As part of reclaim process, we will need to make sure that there are
no active references to pages (taken by get_user_pages()) on the memory
range we are trying to reclaim. I am planning to use
dax_layout_busy_page() for this. But in current form this is per inode
and scans through all the pages of the inode.

We want to reclaim only a portion of memory (say 2MB page). So we want
to make sure that only that 2MB range of pages do not have any
references  (and don't want to unmap all the pages of inode).

Hence, create a range version of this function named
dax_layout_busy_page_range() which can be used to pass a range which
needs to be unmapped.

Cc: Dan Williams 
Cc: linux-nvdimm@lists.01.org
Signed-off-by: Vivek Goyal 
---
 fs/dax.c| 66 -
 include/linux/dax.h |  6 +
 2 files changed, 54 insertions(+), 18 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 11b16729b86f..0d51b0fbb489 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -558,27 +558,20 @@ static void *grab_mapping_entry(struct xa_state *xas,
return xa_mk_internal(VM_FAULT_FALLBACK);
 }
 
-/**
- * dax_layout_busy_page - find first pinned page in @mapping
- * @mapping: address space to scan for a page with ref count > 1
- *
- * DAX requires ZONE_DEVICE mapped pages. These pages are never
- * 'onlined' to the page allocator so they are considered idle when
- * page->count == 1. A filesystem uses this interface to determine if
- * any page in the mapping is busy, i.e. for DMA, or other
- * get_user_pages() usages.
- *
- * It is expected that the filesystem is holding locks to block the
- * establishment of new mappings in this address_space. I.e. it expects
- * to be able to run unmap_mapping_range() and subsequently not race
- * mapping_mapped() becoming true.
+/*
+ * Partial pages are included. If end is LLONG_MAX, pages in the range from
+ * start to end of the file are inluded.
  */
-struct page *dax_layout_busy_page(struct address_space *mapping)
+struct page *dax_layout_busy_page_range(struct address_space *mapping,
+   loff_t start, loff_t end)
 {
-   XA_STATE(xas, >i_pages, 0);
void *entry;
unsigned int scanned = 0;
struct page *page = NULL;
+   pgoff_t start_idx = start >> PAGE_SHIFT;
+   pgoff_t end_idx = end >> PAGE_SHIFT;
+   XA_STATE(xas, >i_pages, start_idx);
+   loff_t len, lstart = round_down(start, PAGE_SIZE);
 
/*
 * In the 'limited' case get_user_pages() for dax is disabled.
@@ -589,6 +582,22 @@ struct page *dax_layout_busy_page(struct address_space 
*mapping)
if (!dax_mapping(mapping) || !mapping_mapped(mapping))
return NULL;
 
+   /* If end == LLONG_MAX, all pages from start to till end of file */
+   if (end == LLONG_MAX) {
+   end_idx = ULONG_MAX;
+   len = 0;
+   } else {
+   /* length is being calculated from lstart and not start.
+* This is due to behavior of unmap_mapping_range(). If
+* start is say 4094 and end is on 4096 then we want to
+* unamp two pages, idx 0 and 1. But unmap_mapping_range()
+* will unmap only page at idx 0. If we calculate len
+* from the rounded down start, this problem should not
+* happen.
+*/
+   len = end - lstart + 1;
+   }
+
/*
 * If we race get_user_pages_fast() here either we'll see the
 * elevated page count in the iteration and wait, or
@@ -601,10 +610,10 @@ struct page *dax_layout_busy_page(struct address_space 
*mapping)
 * guaranteed to either see new references or prevent new
 * references from being established.
 */
-   unmap_mapping_range(mapping, 0, 0, 0);
+   unmap_mapping_range(mapping, start, len, 0);
 
xas_lock_irq();
-   xas_for_each(, entry, ULONG_MAX) {
+   xas_for_each(, entry, end_idx) {
if (WARN_ON_ONCE(!xa_is_value(entry)))
continue;
if (unlikely(dax_is_locked(entry)))
@@ -625,6 +634,27 @@ struct page *dax_layout_busy_page(struct address_space 
*mapping)
xas_unlock_irq();
return page;
 }
+EXPORT_SYMBOL_GPL(dax_layout_busy_page_range);
+
+/**
+ * dax_layout_busy_page - find first pinned page in @mapping
+ * @mapping: address space to scan for a page with ref count > 1
+ *
+ * DAX requires ZONE_DEVICE mapped pages. These pages are never
+ * 'onlined' to the page allocator so they are considered idl

Re: [PATCH v3 2/2] x86/copy_mc: Introduce copy_mc_generic()

2020-05-20 Thread Vivek Goyal

On Tue, May 19, 2020 at 03:12:42PM -0700, Dan Williams wrote:
> The original copy_mc_fragile() implementation had negative performance
> implications since it did not use the fast-string instruction sequence
> to perform copies. For this reason copy_mc_to_kernel() fell back to
> plain memcpy() to preserve performance on platform that did not indicate
> the capability to recover from machine check exceptions. However, that
> capability detection was not architectural and now that some platforms
> can recover from fast-string consumption of memory errors the memcpy()
> fallback now causes these more capable platforms to fail.
> 
> Introduce copy_mc_generic() as the fast default implementation of
> copy_mc_to_kernel() and finalize the transition of copy_mc_fragile() to
> be a platform quirk to indicate 'fragility'. With this in place
> copy_mc_to_kernel() is fast and recovery-ready by default regardless of
> hardware capability.
> 
> Thanks to Vivek for identifying that copy_user_generic() is not suitable
> as the copy_mc_to_user() backend since the #MC handler explicitly checks
> ex_has_fault_handler().

/me is curious to know why #MC handler mandates use of _ASM_EXTABLE_FAULT().

[..]
> +/*
> + * copy_mc_generic - memory copy with exception handling
> + *
> + * Fast string copy + fault / exception handling. If the CPU does
> + * support machine check exception recovery, but does not support
> + * recovering from fast-string exceptions then this CPU needs to be
> + * added to the copy_mc_fragile_key set of quirks. Otherwise, absent any
> + * machine check recovery support this version should be no slower than
> + * standard memcpy.
> + */
> +SYM_FUNC_START(copy_mc_generic)
> + ALTERNATIVE "jmp copy_mc_fragile", "", X86_FEATURE_ERMS
> + movq %rdi, %rax
> + movq %rdx, %rcx
> +.L_copy:
> + rep movsb
> + /* Copy successful. Return zero */
> + xorl %eax, %eax
> + ret
> +SYM_FUNC_END(copy_mc_generic)
> +EXPORT_SYMBOL_GPL(copy_mc_generic)
> +
> + .section .fixup, "ax"
> +.E_copy:
> + /*
> +  * On fault %rcx is updated such that the copy instruction could
> +  * optionally be restarted at the fault position, i.e. it
> +  * contains 'bytes remaining'. A non-zero return indicates error
> +  * to copy_safe() users, or indicate short transfers to

copy_safe() is vestige of terminology of previous patches?

> +  * user-copy routines.
> +  */
> + movq %rcx, %rax
> + ret
> +
> + .previous
> +
> + _ASM_EXTABLE_FAULT(.L_copy, .E_copy)

A question for my education purposes.

So copy_mc_generic() can handle MCE both on source and destination
addresses? (Assuming some device can generate MCE on stores too).
On the other hand copy_mc_fragile() handles MCE recovery only on
source and non-MCE recovery on destination.

Thanks
Vivek
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH v2 0/2] Replace and improve "mcsafe" with copy_safe()

2020-05-11 Thread Vivek Goyal

On Thu, Apr 30, 2020 at 06:21:45PM -0700, Dan Williams wrote:
> On Thu, Apr 30, 2020 at 5:10 PM Linus Torvalds
>  wrote:
> >
> > On Thu, Apr 30, 2020 at 4:52 PM Dan Williams  
> > wrote:
> > >
> > > You had me until here. Up to this point I was grokking that Andy's
> > > "_fallible" suggestion does help explain better than "_safe", because
> > > the copy is doing extra safety checks. copy_to_user() and
> > > copy_to_user_fallible() mean *something* where copy_to_user_safe()
> > > does not.
> >
> > It's a horrible word, btw. The word doesn't actually mean what Andy
> > means it to mean. "fallible" means "can make mistakes", not "can
> > fault".
> >
> > So "fallible" is a horrible name.
> >
> > But anyway, I don't hate something like "copy_to_user_fallible()"
> > conceptually. The naming needs to be fixed, in that "user" can always
> > take a fault, so it's the _source_ that can fault, not the "user"
> > part.
> >
> > It was the "copy_safe()" model that I find unacceptable, that uses
> > _one_ name for what is at the very least *four* different operations:
> >
> >  - copy from faulting memory to user
> >
> >  - copy from faulting memory to kernel
> >
> >  - copy from kernel to faulting memory
> >
> >  - copy within faulting memory
> >
> > No way can you do that with one single function. A kernel address and
> > a user address may literally have the exact same bit representation.
> > So the user vs kernel distinction _has_ to be in the name.
> >
> > The "kernel vs faulting" doesn't necessarily have to be there from an
> > implementation standpoint, but it *should* be there, because
> >
> >  - it might affect implemmentation
> >
> >  - but even if it DOESN'T affect implementation, it should be separate
> > just from the standpoint of being self-documenting code.
> >
> > > However you lose me on this "broken nvdimm semantics" contention.
> > > There is nothing nvdimm-hardware specific about the copy_safe()
> > > implementation, zero, nada, nothing new to the error model that DRAM
> > > did not also inflict on the Linux implementation.
> >
> > Ok, so good. Let's kill this all, and just use memcpy(), and copy_to_user().
> >
> > Just make sure that the nvdimm code doesn't use invalid kernel
> > addresses or other broken poisoning.
> >
> > Problem solved.
> >
> > You can't have it both ways. Either memcpy just works, or it doesn't.
> 
> It doesn't, but copy_to_user() is frustratingly close and you can see
> in the patch that I went ahead and used copy_user_generic() to
> implement the backend of the default "fast" implementation.
> 
> However now I see that copy_user_generic() works for the wrong reason.
> It works because the exception on the source address due to poison
> looks no different than a write fault on the user address to the
> caller, it's still just a short copy. So it makes copy_to_user() work
> for the wrong reason relative to the name.
> 
> How about, following your suggestion, introduce copy_mc_to_user() (can
> just use copy_user_generic() internally) and copy_mc_to_kernel() for
> the other the helpers that the copy_to_iter() implementation needs?
> That makes it clear that no mmu-faults are expected on reads, only
> exceptions, and no protection-faults are expected at all for
> copy_mc_to_kernel() even if it happens to accidentally handle it.
> Following Jann's ex_handler_uaccess() example I could arrange for
> copy_mc_to_kernel() to use a new _ASM_EXTABLE_MC() to validate that
> the only type of exception meant to be handled is MC and warn
> otherwise?

While we are discussing this, I wanted to mention another use case
I am looking at. That is using DAX for virtiofs. virtiofs device
exports a shared memory region which guest maps using DAX. virtiofs
driver dax ops ->copy_to_iter() and ->copy_from_iter() needs to now
copy contents from/to this shared memmory region to user space.

So far we are focussed only on nvdimm and expecting only a machine
check on read side, IIUC. But this virtual device will probably
need something more.

- A page can go missing on host (because file got truncated). So error
  can happen both in read and write path.

- It might not be a machine check to report this kind of error. KVM as
  of now considering using an interrupt to report errors or possibly
  using #VE to report errors.

IOW, tying these new helpers to only machine check will work well for
nvdimm use case but not for virtual devices like virtiofs and we will
end up defining more helpers. Something more generic then machine check
might be able to address both.

Thanks
Vivek
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH 20/20] fuse,virtiofs: Add logic to free up a memory range

2020-04-16 Thread Vivek Goyal

On Thu, Apr 16, 2020 at 01:22:29AM +0800, Liu Bo wrote:
> On Tue, Apr 14, 2020 at 03:30:45PM -0400, Vivek Goyal wrote:
> > On Sat, Mar 28, 2020 at 06:06:06AM +0800, Liu Bo wrote:
> > > On Fri, Mar 27, 2020 at 10:01:14AM -0400, Vivek Goyal wrote:
> > > > On Thu, Mar 26, 2020 at 08:09:05AM +0800, Liu Bo wrote:
> > > > 
> > > > [..]
> > > > > > +/*
> > > > > > + * Find first mapping in the tree and free it and return it. Do 
> > > > > > not add
> > > > > > + * it back to free pool. If fault == true, this function should be 
> > > > > > called
> > > > > > + * with fi->i_mmap_sem held.
> > > > > > + */
> > > > > > +static struct fuse_dax_mapping *inode_reclaim_one_dmap(struct 
> > > > > > fuse_conn *fc,
> > > > > > +struct inode 
> > > > > > *inode,
> > > > > > +bool fault)
> > > > > > +{
> > > > > > +   struct fuse_inode *fi = get_fuse_inode(inode);
> > > > > > +   struct fuse_dax_mapping *dmap;
> > > > > > +   int ret;
> > > > > > +
> > > > > > +   if (!fault)
> > > > > > +   down_write(>i_mmap_sem);
> > > > > > +
> > > > > > +   /*
> > > > > > +* Make sure there are no references to inode pages using
> > > > > > +* get_user_pages()
> > > > > > +*/
> > > > > > +   ret = fuse_break_dax_layouts(inode, 0, 0);
> > > > > 
> > > > > Hi Vivek,
> > > > > 
> > > > > This patch is enabling inline reclaim for fault path, but fault path
> > > > > has already holds a locked exceptional entry which I believe the above
> > > > > fuse_break_dax_layouts() needs to wait for, can you please elaborate
> > > > > on how this can be avoided?
> > > > > 
> > > > 
> > > > Hi Liubo,
> > > > 
> > > > Can you please point to the exact lock you are referring to. I will
> > > > check it out. Once we got rid of needing to take inode lock in
> > > > reclaim path, that opended the door to do inline reclaim in fault
> > > > path as well. But I was not aware of this exceptional entry lock.
> > > 
> > > Hi Vivek,
> > > 
> > > dax_iomap_{pte,pmd}_fault has called grab_mapping_entry to get a
> > > locked entry, when this fault gets into inline reclaim, would
> > > fuse_break_dax_layouts wait for the locked exceptional entry which is
> > > locked in dax_iomap_{pte,pmd}_fault?
> > 
> > Hi Liu Bo,
> > 
> > This is a good point. Indeed it can deadlock the way code is written
> > currently.
> >
> 
> It's 100% reproducible on 4.19, but not on 5.x which has xarray for
> dax_layout_busy_page.
> 
> It was weird that on 5.x kernel the deadlock is gone, it turned out
> that xarray search in dax_layout_busy_page simply skips the empty
> locked exceptional entry, I didn't get deeper to find out whether it's
> reasonable, but with that 5.x doesn't run to deadlock.

I found more problems with enabling inline reclaim in fault path. I
am holding fi->i_mmap_sem, shared and fuse_break_dax_layouts() can
drop fi->i_mmap_sem if page is busy. I don't think we can drop and
reacquire fi->i_mmap_sem while in fault path.

Also fuse_break_dax_layouts() does not know if we are holding it
shared or exclusive.

So I will probably have to go back to disable inline reclaim in
fault path. If memory range is not available go back up in
fuse_dax_fault(), drop fi->i_mmap_sem lock and wait on wait queue for
a range to become free and retry.

I can retain the changes I did to break layout for a 2MB range only
and not the whole file. I think that's a good optimization to retain
anyway.

Vivek
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH 20/20] fuse,virtiofs: Add logic to free up a memory range

2020-04-14 Thread Vivek Goyal

On Sat, Mar 28, 2020 at 06:06:06AM +0800, Liu Bo wrote:
> On Fri, Mar 27, 2020 at 10:01:14AM -0400, Vivek Goyal wrote:
> > On Thu, Mar 26, 2020 at 08:09:05AM +0800, Liu Bo wrote:
> > 
> > [..]
> > > > +/*
> > > > + * Find first mapping in the tree and free it and return it. Do not add
> > > > + * it back to free pool. If fault == true, this function should be 
> > > > called
> > > > + * with fi->i_mmap_sem held.
> > > > + */
> > > > +static struct fuse_dax_mapping *inode_reclaim_one_dmap(struct 
> > > > fuse_conn *fc,
> > > > +struct inode 
> > > > *inode,
> > > > +bool fault)
> > > > +{
> > > > +   struct fuse_inode *fi = get_fuse_inode(inode);
> > > > +   struct fuse_dax_mapping *dmap;
> > > > +   int ret;
> > > > +
> > > > +   if (!fault)
> > > > +   down_write(>i_mmap_sem);
> > > > +
> > > > +   /*
> > > > +* Make sure there are no references to inode pages using
> > > > +* get_user_pages()
> > > > +*/
> > > > +   ret = fuse_break_dax_layouts(inode, 0, 0);
> > > 
> > > Hi Vivek,
> > > 
> > > This patch is enabling inline reclaim for fault path, but fault path
> > > has already holds a locked exceptional entry which I believe the above
> > > fuse_break_dax_layouts() needs to wait for, can you please elaborate
> > > on how this can be avoided?
> > > 
> > 
> > Hi Liubo,
> > 
> > Can you please point to the exact lock you are referring to. I will
> > check it out. Once we got rid of needing to take inode lock in
> > reclaim path, that opended the door to do inline reclaim in fault
> > path as well. But I was not aware of this exceptional entry lock.
> 
> Hi Vivek,
> 
> dax_iomap_{pte,pmd}_fault has called grab_mapping_entry to get a
> locked entry, when this fault gets into inline reclaim, would
> fuse_break_dax_layouts wait for the locked exceptional entry which is
> locked in dax_iomap_{pte,pmd}_fault?

Hi Liu Bo,

This is a good point. Indeed it can deadlock the way code is written
currently.

Currently we are calling fuse_break_dax_layouts() on the whole file
in memory inline reclaim path. I am thinking of changing that. Instead,
find a mapped memory range and file offset and call
fuse_break_dax_layouts() only on that range (2MB). This should ensure
that we don't try to break dax layout in the range where we are holding
exceptional entry lock and avoid deadlock possibility.

This also has added benefit that we don't have to unmap the whole
file in an attempt to reclaim one memory range. We will unmap only
a portion of file and that should be good from performance point of
view.

Here is proof of concept patch which applies on top of my internal 
tree.

---
 fs/fuse/file.c |   72 +++--
 1 file changed, 50 insertions(+), 22 deletions(-)

Index: redhat-linux/fs/fuse/file.c
===
--- redhat-linux.orig/fs/fuse/file.c2020-04-14 13:47:19.493780528 -0400
+++ redhat-linux/fs/fuse/file.c 2020-04-14 14:58:26.814079643 -0400
@@ -4297,13 +4297,13 @@ static int fuse_break_dax_layouts(struct
 return ret;
 }
 
-/* Find first mapping in the tree and free it. */
-static struct fuse_dax_mapping *
-inode_reclaim_one_dmap_locked(struct fuse_conn *fc, struct inode *inode)
+/* Find first mapped dmap for an inode and return file offset. Caller needs
+ * to hold inode->i_dmap_sem lock either shared or exclusive. */
+static struct fuse_dax_mapping *inode_lookup_first_dmap(struct fuse_conn *fc,
+   struct inode *inode)
 {
struct fuse_inode *fi = get_fuse_inode(inode);
struct fuse_dax_mapping *dmap;
-   int ret;
 
for (dmap = fuse_dax_interval_tree_iter_first(>dmap_tree, 0, -1);
 dmap;
@@ -4312,18 +4312,6 @@ inode_reclaim_one_dmap_locked(struct fus
if (refcount_read(>refcnt) > 1)
continue;
 
-   ret = reclaim_one_dmap_locked(fc, inode, dmap);
-   if (ret < 0)
-   return ERR_PTR(ret);
-
-   /* Clean up dmap. Do not add back to free list */
-   dmap_remove_busy_list(fc, dmap);
-   dmap->inode = NULL;
-   dmap->start = dmap->end = 0;
-
-   pr_debug("fuse: %s: reclaimed memory range. inode=%px,"
-"

Re: [PATCH 13/20] fuse, dax: Implement dax read/write operations

2020-04-14 Thread Vivek Goyal

On Sat, Apr 04, 2020 at 08:25:21AM +0800, Liu Bo wrote:

[..]
> > +static int iomap_begin_upgrade_mapping(struct inode *inode, loff_t pos,
> > +loff_t length, unsigned flags,
> > +struct iomap *iomap)
> > +{
> > +   struct fuse_inode *fi = get_fuse_inode(inode);
> > +   struct fuse_dax_mapping *dmap;
> > +   int ret;
> > +
> > +   /*
> > +* Take exclusive lock so that only one caller can try to setup
> > +* mapping and others wait.
> > +*/
> > +   down_write(>i_dmap_sem);
> > +   dmap = fuse_dax_interval_tree_iter_first(>dmap_tree, pos, pos);
> > +
> > +   /* We are holding either inode lock or i_mmap_sem, and that should
> > +* ensure that dmap can't reclaimed or truncated and it should still
> > +* be there in tree despite the fact we dropped and re-acquired the
> > +* lock.
> > +*/
> > +   ret = -EIO;
> > +   if (WARN_ON(!dmap))
> > +   goto out_err;
> > +
> > +   /* Maybe another thread already upgraded mapping while we were not
> > +* holding lock.
> > +*/
> > +   if (dmap->writable)
> 
> oops, looks like it's still returning -EIO here, %ret should be zero.
> 

Good catch. Will fix it.

Vivek
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH v5 4/8] dax, pmem: Add a dax operation zero_page_range

2020-04-01 Thread Vivek Goyal

On Tue, Mar 31, 2020 at 12:38:16PM -0700, Dan Williams wrote:
> On Tue, Feb 18, 2020 at 1:49 PM Vivek Goyal  wrote:
> >
> > Add a dax operation zero_page_range, to zero a range of memory. This will
> > also clear any poison in the range being zeroed.
> >
> > As of now, zeroing of up to one page is allowed in a single call. There
> > are no callers which are trying to zero more than a page in a single call.
> > Once we grow the callers which zero more than a page in single call, we
> > can add that support. Primary reason for not doing that yet is that this
> > will add little complexity in dm implementation where a range might be
> > spanning multiple underlying targets and one will have to split the range
> > into multiple sub ranges and call zero_page_range() on individual targets.
> >
> > Suggested-by: Christoph Hellwig 
> > Reviewed-by: Christoph Hellwig 
> > Signed-off-by: Vivek Goyal 
> > ---
> >  drivers/dax/super.c   | 19 +++
> >  drivers/nvdimm/pmem.c | 10 ++
> >  include/linux/dax.h   |  3 +++
> >  3 files changed, 32 insertions(+)
> >
> > diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> > index 0aa4b6bc5101..c912808bc886 100644
> > --- a/drivers/dax/super.c
> > +++ b/drivers/dax/super.c
> > @@ -344,6 +344,25 @@ size_t dax_copy_to_iter(struct dax_device *dax_dev, 
> > pgoff_t pgoff, void *addr,
> >  }
> >  EXPORT_SYMBOL_GPL(dax_copy_to_iter);
> >
> > +int dax_zero_page_range(struct dax_device *dax_dev, u64 offset, size_t len)
> > +{
> > +   if (!dax_alive(dax_dev))
> > +   return -ENXIO;
> > +
> > +   if (!dax_dev->ops->zero_page_range)
> > +   return -EOPNOTSUPP;
> 
> This seems too late to be doing the validation. It would be odd for
> random filesystem operations to see this error. I would move the check
> to alloc_dax() and fail that if the caller fails to implement the
> operation.
> 
> An incremental patch on top to fix this up would be ok. Something like
> "Now that all dax_operations providers implement zero_page_range()
> mandate it at alloc_dax time".

Hi Dan,

Posted an extra patch in same patch series for this.

https://lore.kernel.org/linux-fsdevel/20200228163456.1587-1-vgo...@redhat.com/T/#m624680cbb5e714266d4b34ade2d6c390dae69598

Vivek
> 
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

[PATCH v6 7/6] dax: Move mandatory ->zero_page_range() check in alloc_dax()

2020-04-01 Thread Vivek Goyal

zero_page_range() dax operation is mandatory for dax devices. Right now
that check happens in dax_zero_page_range() function. Dan thinks that's
too late and its better to do the check earlier in alloc_dax().

I also modified alloc_dax() to return pointer with error code in it in
case of failure. Right now it returns NULL and caller assumes failure
happened due to -ENOMEM. But with this ->zero_page_range() check, I
need to return -EINVAL instead.

Signed-off-by: Vivek Goyal 
---
 drivers/dax/bus.c|4 +++-
 drivers/dax/super.c  |   14 +-
 drivers/md/dm.c  |2 +-
 drivers/nvdimm/pmem.c|4 ++--
 drivers/s390/block/dcssblk.c |5 +++--
 5 files changed, 18 insertions(+), 11 deletions(-)

Index: redhat-linux/drivers/dax/super.c
===
--- redhat-linux.orig/drivers/dax/super.c   2020-04-01 12:03:39.911439769 
-0400
+++ redhat-linux/drivers/dax/super.c2020-04-01 12:05:31.727439769 -0400
@@ -349,9 +349,6 @@ int dax_zero_page_range(struct dax_devic
 {
if (!dax_alive(dax_dev))
return -ENXIO;
-
-   if (!dax_dev->ops->zero_page_range)
-   return -EOPNOTSUPP;
/*
 * There are no callers that want to zero more than one page as of now.
 * Once users are there, this check can be removed after the
@@ -571,9 +568,16 @@ struct dax_device *alloc_dax(void *priva
dev_t devt;
int minor;
 
+   if (ops && !ops->zero_page_range) {
+   pr_debug("%s: error: device does not provide dax"
+" operation zero_page_range()\n",
+__host ? __host : "Unknown");
+   return ERR_PTR(-EINVAL);
+   }
+
host = kstrdup(__host, GFP_KERNEL);
if (__host && !host)
-   return NULL;
+   return ERR_PTR(-ENOMEM);
 
minor = ida_simple_get(_minor_ida, 0, MINORMASK+1, GFP_KERNEL);
if (minor < 0)
@@ -596,7 +600,7 @@ struct dax_device *alloc_dax(void *priva
ida_simple_remove(_minor_ida, minor);
  err_minor:
kfree(host);
-   return NULL;
+   return ERR_PTR(-ENOMEM);
 }
 EXPORT_SYMBOL_GPL(alloc_dax);
 
Index: redhat-linux/drivers/nvdimm/pmem.c
===
--- redhat-linux.orig/drivers/nvdimm/pmem.c 2020-04-01 12:03:39.911439769 
-0400
+++ redhat-linux/drivers/nvdimm/pmem.c  2020-04-01 12:05:31.729439769 -0400
@@ -487,9 +487,9 @@ static int pmem_attach_disk(struct devic
if (is_nvdimm_sync(nd_region))
flags = DAXDEV_F_SYNC;
dax_dev = alloc_dax(pmem, disk->disk_name, _dax_ops, flags);
-   if (!dax_dev) {
+   if (IS_ERR(dax_dev)) {
put_disk(disk);
-   return -ENOMEM;
+   return PTR_ERR(dax_dev);
}
dax_write_cache(dax_dev, nvdimm_has_cache(nd_region));
pmem->dax_dev = dax_dev;
Index: redhat-linux/drivers/dax/bus.c
===
--- redhat-linux.orig/drivers/dax/bus.c 2020-04-01 12:03:39.911439769 -0400
+++ redhat-linux/drivers/dax/bus.c  2020-04-01 12:05:31.729439769 -0400
@@ -421,8 +421,10 @@ struct dev_dax *__devm_create_dev_dax(st
 * device outside of mmap of the resulting character device.
 */
dax_dev = alloc_dax(dev_dax, NULL, NULL, DAXDEV_F_SYNC);
-   if (!dax_dev)
+   if (IS_ERR(dax_dev)) {
+   rc = PTR_ERR(dax_dev);
goto err;
+   }
 
/* a device_dax instance is dead while the driver is not attached */
kill_dax(dax_dev);
Index: redhat-linux/drivers/s390/block/dcssblk.c
===
--- redhat-linux.orig/drivers/s390/block/dcssblk.c  2020-04-01 
12:03:39.911439769 -0400
+++ redhat-linux/drivers/s390/block/dcssblk.c   2020-04-01 12:05:31.730439769 
-0400
@@ -695,8 +695,9 @@ dcssblk_add_store(struct device *dev, st
 
dev_info->dax_dev = alloc_dax(dev_info, dev_info->gd->disk_name,
_dax_ops, DAXDEV_F_SYNC);
-   if (!dev_info->dax_dev) {
-   rc = -ENOMEM;
+   if (IS_ERR(dev_info->dax_dev)) {
+   rc = PTR_ERR(dev_info->dax_dev);
+   dev_info->dax_dev = NULL;
goto put_dev;
}
 
Index: redhat-linux/drivers/md/dm.c
===
--- redhat-linux.orig/drivers/md/dm.c   2020-04-01 12:03:39.911439769 -0400
+++ redhat-linux/drivers/md/dm.c2020-04-01 12:05:31.732439769 -0400
@@ -2005,7 +2005,7 @@ static struct mapped_device *alloc_dev(i
if (IS_ENABLED(CONFIG_DAX_DRIVER)) {
md->dax_dev = alloc_dax(md, md->disk->disk_name,
_dax_ops, 0);
-

Re: [PATCH v5 4/8] dax, pmem: Add a dax operation zero_page_range

2020-04-01 Thread Vivek Goyal

On Tue, Mar 31, 2020 at 12:38:16PM -0700, Dan Williams wrote:
> On Tue, Feb 18, 2020 at 1:49 PM Vivek Goyal  wrote:
> >
> > Add a dax operation zero_page_range, to zero a range of memory. This will
> > also clear any poison in the range being zeroed.
> >
> > As of now, zeroing of up to one page is allowed in a single call. There
> > are no callers which are trying to zero more than a page in a single call.
> > Once we grow the callers which zero more than a page in single call, we
> > can add that support. Primary reason for not doing that yet is that this
> > will add little complexity in dm implementation where a range might be
> > spanning multiple underlying targets and one will have to split the range
> > into multiple sub ranges and call zero_page_range() on individual targets.
> >
> > Suggested-by: Christoph Hellwig 
> > Reviewed-by: Christoph Hellwig 
> > Signed-off-by: Vivek Goyal 
> > ---
> >  drivers/dax/super.c   | 19 +++
> >  drivers/nvdimm/pmem.c | 10 ++
> >  include/linux/dax.h   |  3 +++
> >  3 files changed, 32 insertions(+)
> >
> > diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> > index 0aa4b6bc5101..c912808bc886 100644
> > --- a/drivers/dax/super.c
> > +++ b/drivers/dax/super.c
> > @@ -344,6 +344,25 @@ size_t dax_copy_to_iter(struct dax_device *dax_dev, 
> > pgoff_t pgoff, void *addr,
> >  }
> >  EXPORT_SYMBOL_GPL(dax_copy_to_iter);
> >
> > +int dax_zero_page_range(struct dax_device *dax_dev, u64 offset, size_t len)
> > +{
> > +   if (!dax_alive(dax_dev))
> > +   return -ENXIO;
> > +
> > +   if (!dax_dev->ops->zero_page_range)
> > +   return -EOPNOTSUPP;
> 
> This seems too late to be doing the validation. It would be odd for
> random filesystem operations to see this error. I would move the check
> to alloc_dax() and fail that if the caller fails to implement the
> operation.
> 
> An incremental patch on top to fix this up would be ok. Something like
> "Now that all dax_operations providers implement zero_page_range()
> mandate it at alloc_dax time".

Hi Dan,

Ok, I will send an incremental patch for this.

BTW, I have posted V6 of this patch series and you might want to look
at that instead of V5.

https://lore.kernel.org/linux-fsdevel/20200228163456.1587-1-vgo...@redhat.com/

Vivek
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH 20/20] fuse,virtiofs: Add logic to free up a memory range

2020-03-27 Thread Vivek Goyal

On Thu, Mar 26, 2020 at 08:09:05AM +0800, Liu Bo wrote:

[..]
> > +/*
> > + * Find first mapping in the tree and free it and return it. Do not add
> > + * it back to free pool. If fault == true, this function should be called
> > + * with fi->i_mmap_sem held.
> > + */
> > +static struct fuse_dax_mapping *inode_reclaim_one_dmap(struct fuse_conn 
> > *fc,
> > +struct inode *inode,
> > +bool fault)
> > +{
> > +   struct fuse_inode *fi = get_fuse_inode(inode);
> > +   struct fuse_dax_mapping *dmap;
> > +   int ret;
> > +
> > +   if (!fault)
> > +   down_write(>i_mmap_sem);
> > +
> > +   /*
> > +* Make sure there are no references to inode pages using
> > +* get_user_pages()
> > +*/
> > +   ret = fuse_break_dax_layouts(inode, 0, 0);
> 
> Hi Vivek,
> 
> This patch is enabling inline reclaim for fault path, but fault path
> has already holds a locked exceptional entry which I believe the above
> fuse_break_dax_layouts() needs to wait for, can you please elaborate
> on how this can be avoided?
> 

Hi Liubo,

Can you please point to the exact lock you are referring to. I will
check it out. Once we got rid of needing to take inode lock in
reclaim path, that opended the door to do inline reclaim in fault
path as well. But I was not aware of this exceptional entry lock.

Vivek
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH 00/20] virtiofs: Add DAX support

2020-03-16 Thread Vivek Goyal

On Wed, Mar 11, 2020 at 02:38:03PM +0100, Patrick Ohly wrote:
> Vivek Goyal  writes:
> > This patch series adds DAX support to virtiofs filesystem. This allows
> > bypassing guest page cache and allows mapping host page cache directly
> > in guest address space.
> >
> > When a page of file is needed, guest sends a request to map that page
> > (in host page cache) in qemu address space. Inside guest this is
> > a physical memory range controlled by virtiofs device. And guest
> > directly maps this physical address range using DAX and hence gets
> > access to file data on host.
> >
> > This can speed up things considerably in many situations. Also this
> > can result in substantial memory savings as file data does not have
> > to be copied in guest and it is directly accessed from host page
> > cache.
> 
> As a potential user of this, let me make sure I understand the expected
> outcome: is the goal to let virtiofs use DAX (for increased performance,
> etc.) or also let applications that use virtiofs use DAX?
> 
> You are mentioning using the host's page cache, so it's probably the
> former and MAP_SYNC on virtiofs will continue to be rejected, right?

Hi Patrick,

You are right. Its the former. That is we want virtiofs to be able to
make use of DAX to bypass guest page cache. But there is no persistent
memory so no persistent memory programming semantics available to user
space. For that I guess we have virtio-pmem.

We expect users will issue fsync/msync like a regular filesystem to
make changes persistent. So in that aspect, rejecting MAP_SYNC
makes sense. I will test and see if current code is rejecting MAP_SYNC
or not.

Thanks
Vivek
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH 13/20] fuse, dax: Implement dax read/write operations

2020-03-13 Thread Vivek Goyal

On Fri, Mar 13, 2020 at 11:18:15AM +0100, Miklos Szeredi wrote:

[..]
> > > > +/* offset passed in should be aligned to FUSE_DAX_MEM_RANGE_SZ */
> > > > +static int fuse_setup_one_mapping(struct inode *inode, loff_t offset,
> > > > + struct fuse_dax_mapping *dmap, bool 
> > > > writable,
> > > > + bool upgrade)
> > > > +{
> > > > +   struct fuse_conn *fc = get_fuse_conn(inode);
> > > > +   struct fuse_inode *fi = get_fuse_inode(inode);
> > > > +   struct fuse_setupmapping_in inarg;
> > > > +   FUSE_ARGS(args);
> > > > +   ssize_t err;
> > > > +
> > > > +   WARN_ON(offset % FUSE_DAX_MEM_RANGE_SZ);
> > > > +   WARN_ON(fc->nr_free_ranges < 0);
> > > > +
> > > > +   /* Ask fuse daemon to setup mapping */
> > > > +   memset(, 0, sizeof(inarg));
> > > > +   inarg.foffset = offset;
> > > > +   inarg.fh = -1;
> > > > +   inarg.moffset = dmap->window_offset;
> > > > +   inarg.len = FUSE_DAX_MEM_RANGE_SZ;
> > > > +   inarg.flags |= FUSE_SETUPMAPPING_FLAG_READ;
> > > > +   if (writable)
> > > > +   inarg.flags |= FUSE_SETUPMAPPING_FLAG_WRITE;
> > > > +   args.opcode = FUSE_SETUPMAPPING;
> > > > +   args.nodeid = fi->nodeid;
> > > > +   args.in_numargs = 1;
> > > > +   args.in_args[0].size = sizeof(inarg);
> > > > +   args.in_args[0].value = 
> > >
> > > args.force = true?
> >
> > I can do that but I am not sure what exactly does args.force do and
> > why do we need it in this case.
> 
> Hm, it prevents interrupts.  Looking closely, however it will only
> prevent SIGKILL from immediately interrupting the request, otherwise
> it will send an INTERRUPT request and the filesystem can ignore that.
> Might make sense to have a args.nonint flag to prevent the sending of
> INTERRUPT...

Hi Miklos,

virtiofs does not support interrupt requests yet. Its fiq interrupt
handler just does not do anything.

static void virtio_fs_wake_interrupt_and_unlock(struct fuse_iqueue *fiq)
__releases(fiq->lock)
{
/*
 * TODO interrupts.
 *
 * Normal fs operations on a local filesystems aren't interruptible.
 * Exceptions are blocking lock operations; for example fcntl(F_SETLKW)
 * with shared lock between host and guest.
 */
spin_unlock(>lock);
}

So as of now setting force or not will not make any difference. We will
still end up waiting for request to finish.

Infact, I think there is no mechanism to set fc->no_interrupt in
virtio_fs. If I am reading request_wait_answer(), correctly, it will
see fc->no_interrupt is not set. That means filesystem supports
interrupt requests and it will do wait_event_interruptible() and
not even check for FR_FORCE bit. 

Right now fc->no_interrupt is set in response to INTERRUPT request
reply. Will it make sense to also be able to set it as part of
connection negotation protocol and filesystem can tell in the
beginning itself that it does not support interrupt and virtiofs
can make use of that.

So force flag is only useful if filesystem does not support interrupt
and in that case we do wait_event_killable() and upon receiving
SIGKILL, cancel request if it is still in pending queue. For virtiofs,
we take request out of fiq->pending queue in submission path itself
and if it can't be dispatched it waits on virtiofs speicfic queue
with FR_PENDING cleared. That means, setting FR_FORCE for virtiofs
does not mean anything as caller will end up waiting for
request to finish anyway.

IOW, setting FR_FORCE will make sense when we have mechanism to
detect that request is still queued in virtiofs queues and have
mechanism to cancel it. We don't have it. In fact, given we are
a push model, we dispatch request immediately to filesystem,
until and unless virtqueue is full. So probability of a request
still in virtiofs queue is low.

So may be we can start setting force at some point of time later
when we have mechanism to cancel detect and cancel pending requests
in virtiofs.

> 
> > First thing it does is that request is allocated with flag __GFP_NOFAIL.
> > Second thing it does is that caller is forced to wait for request
> > completion and its not an interruptible sleep.
> >
> > I am wondering what makes FUSE_SETUPMAPING/FUSE_REMOVEMAPPING requests
> > special that we need to set force flag.
> 
> Maybe not for SETUPMAPPING (I was confused by the error log).
> 
> However if REMOVEMAPPING fails for some reason, than that dax mapping
> will be leaked for the lifetime of the filesystem.   Or am I
> misunderstanding it?

FUSE_REMVOEMAPPING is not must. If we send another FUSE_SETUPMAPPING, then
it will create the new mapping and free up resources associated with
the previous mapping, IIUC.

So at one point of time we were thinking that what's the point of
sending FUSE_REMOVEMAPPING. It helps a bit with freeing up filesystem
resources earlier. So if cache size is big, then there will not be
much reclaim

Re: [PATCH 13/20] fuse, dax: Implement dax read/write operations

2020-03-12 Thread Vivek Goyal

On Thu, Mar 12, 2020 at 10:43:10AM +0100, Miklos Szeredi wrote:
> On Wed, Mar 4, 2020 at 5:59 PM Vivek Goyal  wrote:
> >
> > This patch implements basic DAX support. mmap() is not implemented
> > yet and will come in later patches. This patch looks into implemeting
> > read/write.
> >
> > We make use of interval tree to keep track of per inode dax mappings.
> >
> > Do not use dax for file extending writes, instead just send WRITE message
> > to daemon (like we do for direct I/O path). This will keep write and
> > i_size change atomic w.r.t crash.
> >
> > Signed-off-by: Stefan Hajnoczi 
> > Signed-off-by: Dr. David Alan Gilbert 
> > Signed-off-by: Vivek Goyal 
> > Signed-off-by: Miklos Szeredi 
> > Signed-off-by: Liu Bo 
> > Signed-off-by: Peng Tao 
> > ---
> >  fs/fuse/file.c| 597 +-
> >  fs/fuse/fuse_i.h  |  23 ++
> >  fs/fuse/inode.c   |   6 +
> >  include/uapi/linux/fuse.h |   1 +
> >  4 files changed, 621 insertions(+), 6 deletions(-)
> >
> > diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> > index 9d67b830fb7a..9effdd3dc6d6 100644
> > --- a/fs/fuse/file.c
> > +++ b/fs/fuse/file.c
> > @@ -18,6 +18,12 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> > +#include 
> > +#include 
> > +
> > +INTERVAL_TREE_DEFINE(struct fuse_dax_mapping, rb, __u64, __subtree_last,
> > + START, LAST, static inline, fuse_dax_interval_tree);
> 
> Are you using this because of byte ranges (u64)?   Does it not make
> more sense to use page offsets, which are unsigned long and so fit
> nicely into the generic interval tree?

I think I should be able to use generic interval tree. I will switch
to that.

[..]
> > +/* offset passed in should be aligned to FUSE_DAX_MEM_RANGE_SZ */
> > +static int fuse_setup_one_mapping(struct inode *inode, loff_t offset,
> > + struct fuse_dax_mapping *dmap, bool 
> > writable,
> > + bool upgrade)
> > +{
> > +   struct fuse_conn *fc = get_fuse_conn(inode);
> > +   struct fuse_inode *fi = get_fuse_inode(inode);
> > +   struct fuse_setupmapping_in inarg;
> > +   FUSE_ARGS(args);
> > +   ssize_t err;
> > +
> > +   WARN_ON(offset % FUSE_DAX_MEM_RANGE_SZ);
> > +   WARN_ON(fc->nr_free_ranges < 0);
> > +
> > +   /* Ask fuse daemon to setup mapping */
> > +   memset(, 0, sizeof(inarg));
> > +   inarg.foffset = offset;
> > +   inarg.fh = -1;
> > +   inarg.moffset = dmap->window_offset;
> > +   inarg.len = FUSE_DAX_MEM_RANGE_SZ;
> > +   inarg.flags |= FUSE_SETUPMAPPING_FLAG_READ;
> > +   if (writable)
> > +   inarg.flags |= FUSE_SETUPMAPPING_FLAG_WRITE;
> > +   args.opcode = FUSE_SETUPMAPPING;
> > +   args.nodeid = fi->nodeid;
> > +   args.in_numargs = 1;
> > +   args.in_args[0].size = sizeof(inarg);
> > +   args.in_args[0].value = 
> 
> args.force = true?

I can do that but I am not sure what exactly does args.force do and
why do we need it in this case.

First thing it does is that request is allocated with flag __GFP_NOFAIL.
Second thing it does is that caller is forced to wait for request
completion and its not an interruptible sleep. 

I am wondering what makes FUSE_SETUPMAPING/FUSE_REMOVEMAPPING requests
special that we need to set force flag.

> 
> > +   err = fuse_simple_request(fc, );
> > +   if (err < 0) {
> > +   printk(KERN_ERR "%s request failed at mem_offset=0x%llx 
> > %zd\n",
> > +__func__, dmap->window_offset, err);
> 
> Is this level of noisiness really needed?  AFAICS, the error will
> reach the caller, in which case we don't usually need to print a
> kernel error.

I will remove it. I think code in general has quite a few printk() and
pr_debug() we can get rid of. Some of them were helpful for debugging
problems while code was being developed. But now that code is working,
we should be able to drop some of them.

[..]
> > +static int
> > +fuse_send_removemapping(struct inode *inode,
> > +   struct fuse_removemapping_in *inargp,
> > +   struct fuse_removemapping_one *remove_one)
> > +{
> > +   struct fuse_inode *fi = get_fuse_inode(inode);
> > +   struct fuse_conn *fc = get_fuse_conn(inode);
> > +   FUSE_ARGS(args);
> > +
> > +   args.opcode = FUSE_REMOVEMAPPING;
> > +   args.node

Re: [PATCH 00/20] virtiofs: Add DAX support

2020-03-11 Thread Vivek Goyal

On Wed, Mar 11, 2020 at 09:32:17PM +0200, Amir Goldstein wrote:
> On Wed, Mar 11, 2020 at 8:48 PM Vivek Goyal  wrote:
> >
> > On Wed, Mar 11, 2020 at 07:22:51AM +0200, Amir Goldstein wrote:
> > > On Wed, Mar 4, 2020 at 7:01 PM Vivek Goyal  wrote:
> > > >
> > > > Hi,
> > > >
> > > > This patch series adds DAX support to virtiofs filesystem. This allows
> > > > bypassing guest page cache and allows mapping host page cache directly
> > > > in guest address space.
> > > >
> > > > When a page of file is needed, guest sends a request to map that page
> > > > (in host page cache) in qemu address space. Inside guest this is
> > > > a physical memory range controlled by virtiofs device. And guest
> > > > directly maps this physical address range using DAX and hence gets
> > > > access to file data on host.
> > > >
> > > > This can speed up things considerably in many situations. Also this
> > > > can result in substantial memory savings as file data does not have
> > > > to be copied in guest and it is directly accessed from host page
> > > > cache.
> > > >
> > > > Most of the changes are limited to fuse/virtiofs. There are couple
> > > > of changes needed in generic dax infrastructure and couple of changes
> > > > in virtio to be able to access shared memory region.
> > > >
> > > > These patches apply on top of 5.6-rc4 and are also available here.
> > > >
> > > > https://github.com/rhvgoyal/linux/commits/vivek-04-march-2020
> > > >
> > > > Any review or feedback is welcome.
> > > >
> > > [...]
> > > >  drivers/dax/super.c|3 +-
> > > >  drivers/virtio/virtio_mmio.c   |   32 +
> > > >  drivers/virtio/virtio_pci_modern.c |  107 +++
> > > >  fs/dax.c   |   66 +-
> > > >  fs/fuse/dir.c  |2 +
> > > >  fs/fuse/file.c | 1162 +++-
> > >
> > > That's a big addition to already big file.c.
> > > Maybe split dax specific code to dax.c?
> > > Can be a post series cleanup too.
> >
> > How about fs/fuse/iomap.c instead. This will have all the iomap related 
> > logic
> > as well as all the dax range allocation/free logic which is required
> > by iomap logic. That moves about 900 lines of code from file.c to iomap.c
> >
> 
> Fine by me. I didn't take time to study the code in file.c
> I just noticed is has grown a lot bigger and wasn't sure that
> it made sense. Up to you. Only if you think the result would be nicer
> to maintain.

I am happy to move this code to a separate file. In fact I think we could
probably break it further into another file say dax-mapping.c or something
like that where all the memory range allocation/reclaim logic goes and
iomap logic remains in iomap.c.

But that's probably a future cleanup if code in this file continues to grow.

Vivek
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH 04/20] virtio: Implement get_shm_region for PCI transport

2020-03-11 Thread Vivek Goyal

On Wed, Mar 11, 2020 at 05:34:05PM +, Stefan Hajnoczi wrote:
> On Tue, Mar 10, 2020 at 02:19:36PM -0400, Vivek Goyal wrote:
> > On Tue, Mar 10, 2020 at 11:04:37AM +, Stefan Hajnoczi wrote:
> > > On Wed, Mar 04, 2020 at 11:58:29AM -0500, Vivek Goyal wrote:
> > > > diff --git a/drivers/virtio/virtio_pci_modern.c 
> > > > b/drivers/virtio/virtio_pci_modern.c
> > > > index 7abcc50838b8..52f179411015 100644
> > > > --- a/drivers/virtio/virtio_pci_modern.c
> > > > +++ b/drivers/virtio/virtio_pci_modern.c
> > > > @@ -443,6 +443,111 @@ static void del_vq(struct virtio_pci_vq_info 
> > > > *info)
> > > > vring_del_virtqueue(vq);
> > > >  }
> > > >  
> > > > +static int virtio_pci_find_shm_cap(struct pci_dev *dev,
> > > > +   u8 required_id,
> > > > +   u8 *bar, u64 *offset, u64 *len)
> > > > +{
> > > > +   int pos;
> > > > +
> > > > +for (pos = pci_find_capability(dev, PCI_CAP_ID_VNDR);
> > > 
> > > Please fix the mixed tabs vs space indentation in this patch.
> > 
> > Will do. There are plenty of these in this patch.
> > 
> > > 
> > > > +static bool vp_get_shm_region(struct virtio_device *vdev,
> > > > + struct virtio_shm_region *region, u8 id)
> > > > +{
> > > > +   struct virtio_pci_device *vp_dev = to_vp_device(vdev);
> > > > +   struct pci_dev *pci_dev = vp_dev->pci_dev;
> > > > +   u8 bar;
> > > > +   u64 offset, len;
> > > > +   phys_addr_t phys_addr;
> > > > +   size_t bar_len;
> > > > +   int ret;
> > > > +
> > > > +   if (!virtio_pci_find_shm_cap(pci_dev, id, , , )) 
> > > > {
> > > > +   return false;
> > > > +   }
> > > > +
> > > > +   ret = pci_request_region(pci_dev, bar, "virtio-pci-shm");
> > > > +   if (ret < 0) {
> > > > +   dev_err(_dev->dev, "%s: failed to request BAR\n",
> > > > +   __func__);
> > > > +   return false;
> > > > +   }
> > > > +
> > > > +   phys_addr = pci_resource_start(pci_dev, bar);
> > > > +   bar_len = pci_resource_len(pci_dev, bar);
> > > > +
> > > > +if (offset + len > bar_len) {
> > > > +dev_err(_dev->dev,
> > > > +"%s: bar shorter than cap offset+len\n",
> > > > +__func__);
> > > > +return false;
> > > > +}
> > > > +
> > > > +   region->len = len;
> > > > +   region->addr = (u64) phys_addr + offset;
> > > > +
> > > > +   return true;
> > > > +}
> > > 
> > > Missing pci_release_region()?
> > 
> > Good catch. We don't have a mechanism to call pci_relese_region() and 
> > virtio-mmio device's ->get_shm_region() implementation does not even
> > seem to reserve the resources.
> > 
> > So how about we leave this resource reservation to the caller.
> > ->get_shm_region() just returns the addr/len pair of requested resource.
> > 
> > Something like this patch.
> > 
> > ---
> >  drivers/virtio/virtio_pci_modern.c |8 
> >  fs/fuse/virtio_fs.c|   13 ++---
> >  2 files changed, 10 insertions(+), 11 deletions(-)
> > 
> > Index: redhat-linux/fs/fuse/virtio_fs.c
> > ===
> > --- redhat-linux.orig/fs/fuse/virtio_fs.c   2020-03-10 09:13:34.624565666 
> > -0400
> > +++ redhat-linux/fs/fuse/virtio_fs.c2020-03-10 14:11:10.970284651 
> > -0400
> > @@ -763,11 +763,18 @@ static int virtio_fs_setup_dax(struct vi
> > if (!have_cache) {
> > dev_notice(>dev, "%s: No cache capability\n", __func__);
> > return 0;
> > -   } else {
> > -   dev_notice(>dev, "Cache len: 0x%llx @ 0x%llx\n",
> > -  cache_reg.len, cache_reg.addr);
> > }
> >  
> > +   if (!devm_request_mem_region(>dev, cache_reg.addr, cache_reg.len,
> > +dev_name(>dev))) {
> > +   d

Re: [PATCH 00/20] virtiofs: Add DAX support

2020-03-11 Thread Vivek Goyal

On Wed, Mar 11, 2020 at 07:22:51AM +0200, Amir Goldstein wrote:
> On Wed, Mar 4, 2020 at 7:01 PM Vivek Goyal  wrote:
> >
> > Hi,
> >
> > This patch series adds DAX support to virtiofs filesystem. This allows
> > bypassing guest page cache and allows mapping host page cache directly
> > in guest address space.
> >
> > When a page of file is needed, guest sends a request to map that page
> > (in host page cache) in qemu address space. Inside guest this is
> > a physical memory range controlled by virtiofs device. And guest
> > directly maps this physical address range using DAX and hence gets
> > access to file data on host.
> >
> > This can speed up things considerably in many situations. Also this
> > can result in substantial memory savings as file data does not have
> > to be copied in guest and it is directly accessed from host page
> > cache.
> >
> > Most of the changes are limited to fuse/virtiofs. There are couple
> > of changes needed in generic dax infrastructure and couple of changes
> > in virtio to be able to access shared memory region.
> >
> > These patches apply on top of 5.6-rc4 and are also available here.
> >
> > https://github.com/rhvgoyal/linux/commits/vivek-04-march-2020
> >
> > Any review or feedback is welcome.
> >
> [...]
> >  drivers/dax/super.c|3 +-
> >  drivers/virtio/virtio_mmio.c   |   32 +
> >  drivers/virtio/virtio_pci_modern.c |  107 +++
> >  fs/dax.c   |   66 +-
> >  fs/fuse/dir.c  |2 +
> >  fs/fuse/file.c | 1162 +++-
> 
> That's a big addition to already big file.c.
> Maybe split dax specific code to dax.c?
> Can be a post series cleanup too.

How about fs/fuse/iomap.c instead. This will have all the iomap related logic
as well as all the dax range allocation/free logic which is required
by iomap logic. That moves about 900 lines of code from file.c to iomap.c

Vivek
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH 12/20] fuse: Introduce setupmapping/removemapping commands

2020-03-11 Thread Vivek Goyal

On Wed, Mar 11, 2020 at 03:19:18PM +0100, Miklos Szeredi wrote:
> On Wed, Mar 11, 2020 at 8:03 AM Amir Goldstein  wrote:
> >
> > On Tue, Mar 10, 2020 at 10:34 PM Vivek Goyal  wrote:
> > >
> > > On Tue, Mar 10, 2020 at 08:49:49PM +0100, Miklos Szeredi wrote:
> > > > On Wed, Mar 4, 2020 at 5:59 PM Vivek Goyal  wrote:
> > > > >
> > > > > Introduce two new fuse commands to setup/remove memory mappings. This
> > > > > will be used to setup/tear down file mapping in dax window.
> > > > >
> > > > > Signed-off-by: Vivek Goyal 
> > > > > Signed-off-by: Peng Tao 
> > > > > ---
> > > > >  include/uapi/linux/fuse.h | 37 +
> > > > >  1 file changed, 37 insertions(+)
> > > > >
> > > > > diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
> > > > > index 5b85819e045f..62633555d547 100644
> > > > > --- a/include/uapi/linux/fuse.h
> > > > > +++ b/include/uapi/linux/fuse.h
> > > > > @@ -894,4 +894,41 @@ struct fuse_copy_file_range_in {
> > > > > uint64_tflags;
> > > > >  };
> > > > >
> > > > > +#define FUSE_SETUPMAPPING_ENTRIES 8
> > > > > +#define FUSE_SETUPMAPPING_FLAG_WRITE (1ull << 0)
> > > > > +struct fuse_setupmapping_in {
> > > > > +   /* An already open handle */
> > > > > +   uint64_tfh;
> > > > > +   /* Offset into the file to start the mapping */
> > > > > +   uint64_tfoffset;
> > > > > +   /* Length of mapping required */
> > > > > +   uint64_tlen;
> > > > > +   /* Flags, FUSE_SETUPMAPPING_FLAG_* */
> > > > > +   uint64_tflags;
> > > > > +   /* Offset in Memory Window */
> > > > > +   uint64_tmoffset;
> > > > > +};
> > > > > +
> > > > > +struct fuse_setupmapping_out {
> > > > > +   /* Offsets into the cache of mappings */
> > > > > +   uint64_tcoffset[FUSE_SETUPMAPPING_ENTRIES];
> > > > > +/* Lengths of each mapping */
> > > > > +uint64_t   len[FUSE_SETUPMAPPING_ENTRIES];
> > > > > +};
> > > >
> > > > fuse_setupmapping_out together with FUSE_SETUPMAPPING_ENTRIES seem to 
> > > > be unused.
> > >
> > > This looks like leftover from the old code. I will get rid of it. Thanks.
> > >
> >
> > Hmm. I wonder if we should keep some out args for future extensions.
> > Maybe return the mapped size even though it is all or nothing at this
> > point?
> >
> > I have interest in a similar FUSE mapping functionality that was prototyped
> > by Miklos and published here:
> > https://lore.kernel.org/linux-fsdevel/cajfpegtjeoe7h8taylaqhg9frsbivuaspnmpr2oqiozxvb1...@mail.gmail.com/
> >
> > In this prototype, a FUSE_MAP command is used by the server to map a
> > range of file to the kernel for io. The command in args are quite similar to
> > those in fuse_setupmapping_in, but since the server is on the same host,
> > the mapping response is {mapfd, offset, size}.
> 
> Right.  So the difference is in which entity allocates the mapping.
> IOW whether the {fd, offset, size} is input or output in the protocol.
> 
> I don't remember the reasons for going with the mapping being
> allocated by the client, not the other way round.   Vivek?

I think one of the main reasons is memory reclaim. Once all ranges in 
a cache range are allocated, we need to free a memory range which can be
reused. And client has all the logic to free up that range so that it can
be remapped and reused for a different file/offset. Server will not know
any of this. So I will think that for virtiofs, server might not be
able to decide where to map a section of file and it has to be told
explicitly by the client.

> 
> If the allocation were to be by the server, we could share the request
> type and possibly some code between the two, although the I/O
> mechanism would still be different.
> 

So input parameters of both FUSE_SETUPMAPPING and FUSE_MAP seem
similar (except the moffset field).  Given output of FUSE_MAP reqeust
is very different, I would think it will be easier to have it as a
separate command.

Or can it be some sort of optional output args which can differentiate
between two types of requests. 

/me personally finds it simpler to have separate command instead of
overloading FUSE_SETUPMAPPING. But its your call. :-) 

Vivek
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH 00/20] virtiofs: Add DAX support

2020-03-11 Thread Vivek Goyal

On Wed, Mar 11, 2020 at 07:22:51AM +0200, Amir Goldstein wrote:
> On Wed, Mar 4, 2020 at 7:01 PM Vivek Goyal  wrote:
> >
> > Hi,
> >
> > This patch series adds DAX support to virtiofs filesystem. This allows
> > bypassing guest page cache and allows mapping host page cache directly
> > in guest address space.
> >
> > When a page of file is needed, guest sends a request to map that page
> > (in host page cache) in qemu address space. Inside guest this is
> > a physical memory range controlled by virtiofs device. And guest
> > directly maps this physical address range using DAX and hence gets
> > access to file data on host.
> >
> > This can speed up things considerably in many situations. Also this
> > can result in substantial memory savings as file data does not have
> > to be copied in guest and it is directly accessed from host page
> > cache.
> >
> > Most of the changes are limited to fuse/virtiofs. There are couple
> > of changes needed in generic dax infrastructure and couple of changes
> > in virtio to be able to access shared memory region.
> >
> > These patches apply on top of 5.6-rc4 and are also available here.
> >
> > https://github.com/rhvgoyal/linux/commits/vivek-04-march-2020
> >
> > Any review or feedback is welcome.
> >
> [...]
> >  drivers/dax/super.c|3 +-
> >  drivers/virtio/virtio_mmio.c   |   32 +
> >  drivers/virtio/virtio_pci_modern.c |  107 +++
> >  fs/dax.c   |   66 +-
> >  fs/fuse/dir.c  |2 +
> >  fs/fuse/file.c | 1162 +++-
> 
> That's a big addition to already big file.c.
> Maybe split dax specific code to dax.c?
> Can be a post series cleanup too.

Lot of this code is coming from logic to reclaim dax memory range
assigned to inode. I will look into moving some of it to a separate
file.

Vivek
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH 20/20] fuse,virtiofs: Add logic to free up a memory range

2020-03-11 Thread Vivek Goyal

On Wed, Mar 11, 2020 at 01:16:42PM +0800, Liu Bo wrote:

[..]
> > @@ -719,6 +723,7 @@ void fuse_conn_put(struct fuse_conn *fc)
> > if (refcount_dec_and_test(>count)) {
> > struct fuse_iqueue *fiq = >iq;
> >  
> > +   flush_delayed_work(>dax_free_work);
> 
> Today while debugging another case, I realized that flushing work here
> at the very last fuse_conn_put() is a bit too late, here's my analysis,
> 
>  umount   kthread
> 
> deactivate_locked_super
>   ->virtio_kill_sb
> try_to_free_dmap_chunks
> ->generic_shutdown_super->igrab()
> ...
>  ->evict_inodes()  -> check all inodes' count
>  ->fuse_conn_put->iput
>  ->virtio_fs_free_devs
>->fuse_dev_free
>  ->fuse_conn_put // vq1
>->fuse_dev_free
>  ->fuse_conn_put // vq2
>->flush_delayed_work
> 
> The above can end up with a warning message reported by evict_inodes()
> about stable inodes.

Hi Liu Bo,

Which warning is that? Can you point me to it in code.

> So I think it's necessary to put either
> cancel_delayed_work_sync() or flush_delayed_work() before going to
> generic_shutdown_super().

In general I agree that shutting down memory range freeing worker
earling in unmount/shutdown sequence makes sense. It does not seem
to help to let it run while filesystem is going away. How about following
patch.

---
 fs/fuse/inode.c |1 -
 fs/fuse/virtio_fs.c |5 +
 2 files changed, 5 insertions(+), 1 deletion(-)

Index: redhat-linux/fs/fuse/virtio_fs.c
===
--- redhat-linux.orig/fs/fuse/virtio_fs.c   2020-03-10 14:11:10.970284651 
-0400
+++ redhat-linux/fs/fuse/virtio_fs.c2020-03-11 08:27:08.103330039 -0400
@@ -1295,6 +1295,11 @@ static void virtio_kill_sb(struct super_
vfs = fc->iq.priv;
fsvq = >vqs[VQ_HIPRIO];
 
+   /* Stop dax worker. Soon evict_inodes() will be called which will
+* free all memory ranges belonging to all inodes.
+*/
+   flush_delayed_work(>dax_free_work);
+
/* Stop forget queue. Soon destroy will be sent */
spin_lock(>lock);
fsvq->connected = false;
Index: redhat-linux/fs/fuse/inode.c
===
--- redhat-linux.orig/fs/fuse/inode.c   2020-03-10 09:13:35.132565666 -0400
+++ redhat-linux/fs/fuse/inode.c2020-03-11 08:22:02.685330039 -0400
@@ -723,7 +723,6 @@ void fuse_conn_put(struct fuse_conn *fc)
if (refcount_dec_and_test(>count)) {
struct fuse_iqueue *fiq = >iq;
 
-   flush_delayed_work(>dax_free_work);
if (fc->dax_dev)
fuse_free_dax_mem_ranges(>free_ranges);
if (fiq->ops->release)
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH 12/20] fuse: Introduce setupmapping/removemapping commands

2020-03-10 Thread Vivek Goyal

On Tue, Mar 10, 2020 at 08:49:49PM +0100, Miklos Szeredi wrote:
> On Wed, Mar 4, 2020 at 5:59 PM Vivek Goyal  wrote:
> >
> > Introduce two new fuse commands to setup/remove memory mappings. This
> > will be used to setup/tear down file mapping in dax window.
> >
> > Signed-off-by: Vivek Goyal 
> > Signed-off-by: Peng Tao 
> > ---
> >  include/uapi/linux/fuse.h | 37 +
> >  1 file changed, 37 insertions(+)
> >
> > diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
> > index 5b85819e045f..62633555d547 100644
> > --- a/include/uapi/linux/fuse.h
> > +++ b/include/uapi/linux/fuse.h
> > @@ -894,4 +894,41 @@ struct fuse_copy_file_range_in {
> > uint64_tflags;
> >  };
> >
> > +#define FUSE_SETUPMAPPING_ENTRIES 8
> > +#define FUSE_SETUPMAPPING_FLAG_WRITE (1ull << 0)
> > +struct fuse_setupmapping_in {
> > +   /* An already open handle */
> > +   uint64_tfh;
> > +   /* Offset into the file to start the mapping */
> > +   uint64_tfoffset;
> > +   /* Length of mapping required */
> > +   uint64_tlen;
> > +   /* Flags, FUSE_SETUPMAPPING_FLAG_* */
> > +   uint64_tflags;
> > +   /* Offset in Memory Window */
> > +   uint64_tmoffset;
> > +};
> > +
> > +struct fuse_setupmapping_out {
> > +   /* Offsets into the cache of mappings */
> > +   uint64_tcoffset[FUSE_SETUPMAPPING_ENTRIES];
> > +/* Lengths of each mapping */
> > +uint64_t   len[FUSE_SETUPMAPPING_ENTRIES];
> > +};
> 
> fuse_setupmapping_out together with FUSE_SETUPMAPPING_ENTRIES seem to be 
> unused.

This looks like leftover from the old code. I will get rid of it. Thanks.

Vivek
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH 02/20] dax: Create a range version of dax_layout_busy_page()

2020-03-10 Thread Vivek Goyal

On Tue, Mar 10, 2020 at 08:19:07AM -0700, Ira Weiny wrote:
> On Wed, Mar 04, 2020 at 11:58:27AM -0500, Vivek Goyal wrote:
> >  
> > +   /* If end == 0, all pages from start to till end of file */
> > +   if (!end) {
> > +   end_idx = ULONG_MAX;
> > +   len = 0;
> 
> I find this a bit odd to specify end == 0 for ULONG_MAX...
> 
> >  }
> > +EXPORT_SYMBOL_GPL(dax_layout_busy_page_range);
> > +
> > +/**
> > + * dax_layout_busy_page - find first pinned page in @mapping
> > + * @mapping: address space to scan for a page with ref count > 1
> > + *
> > + * DAX requires ZONE_DEVICE mapped pages. These pages are never
> > + * 'onlined' to the page allocator so they are considered idle when
> > + * page->count == 1. A filesystem uses this interface to determine if
> > + * any page in the mapping is busy, i.e. for DMA, or other
> > + * get_user_pages() usages.
> > + *
> > + * It is expected that the filesystem is holding locks to block the
> > + * establishment of new mappings in this address_space. I.e. it expects
> > + * to be able to run unmap_mapping_range() and subsequently not race
> > + * mapping_mapped() becoming true.
> > + */
> > +struct page *dax_layout_busy_page(struct address_space *mapping)
> > +{
> > +   return dax_layout_busy_page_range(mapping, 0, 0);
> 
> ... other functions I have seen specify ULONG_MAX here.  Which IMO makes this
> call site more clear.

I think I looked at unmap_mapping_range() where holelen=0 implies till the
end of file and followed same pattern.

But I agree that LLONG_MAX (end is of type loff_t) is probably more
intuitive. I will change it.

Vivek
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH 04/20] virtio: Implement get_shm_region for PCI transport

2020-03-10 Thread Vivek Goyal

On Tue, Mar 10, 2020 at 07:12:25AM -0400, Michael S. Tsirkin wrote:
[..]
> > +static bool vp_get_shm_region(struct virtio_device *vdev,
> > + struct virtio_shm_region *region, u8 id)
> > +{
> > +   struct virtio_pci_device *vp_dev = to_vp_device(vdev);
> > +   struct pci_dev *pci_dev = vp_dev->pci_dev;
> > +   u8 bar;
> > +   u64 offset, len;
> > +   phys_addr_t phys_addr;
> > +   size_t bar_len;
> > +   int ret;
> > +
> > +   if (!virtio_pci_find_shm_cap(pci_dev, id, , , )) {
> > +   return false;
> > +   }
> > +
> > +   ret = pci_request_region(pci_dev, bar, "virtio-pci-shm");
> > +   if (ret < 0) {
> > +   dev_err(_dev->dev, "%s: failed to request BAR\n",
> > +   __func__);
> > +   return false;
> > +   }
> > +
> > +   phys_addr = pci_resource_start(pci_dev, bar);
> > +   bar_len = pci_resource_len(pci_dev, bar);
> > +
> > +if (offset + len > bar_len) {
> > +dev_err(_dev->dev,
> > +"%s: bar shorter than cap offset+len\n",
> > +__func__);
> > +return false;
> > +}
> > +
> 
> Something wrong with indentation here.

Will fix all indentation related issues in this patch.

> Also as long as you are validating things, it's worth checking
> offset + len does not overflow.

Something like addition of following lines?

+   if ((offset + len) < offset) {
+   dev_err(_dev->dev, "%s: cap offset+len overflow detected\n",
+   __func__);
+   return false;
+   }

Vivek
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH 04/20] virtio: Implement get_shm_region for PCI transport

2020-03-10 Thread Vivek Goyal

On Tue, Mar 10, 2020 at 11:04:37AM +, Stefan Hajnoczi wrote:
> On Wed, Mar 04, 2020 at 11:58:29AM -0500, Vivek Goyal wrote:
> > diff --git a/drivers/virtio/virtio_pci_modern.c 
> > b/drivers/virtio/virtio_pci_modern.c
> > index 7abcc50838b8..52f179411015 100644
> > --- a/drivers/virtio/virtio_pci_modern.c
> > +++ b/drivers/virtio/virtio_pci_modern.c
> > @@ -443,6 +443,111 @@ static void del_vq(struct virtio_pci_vq_info *info)
> > vring_del_virtqueue(vq);
> >  }
> >  
> > +static int virtio_pci_find_shm_cap(struct pci_dev *dev,
> > +   u8 required_id,
> > +   u8 *bar, u64 *offset, u64 *len)
> > +{
> > +   int pos;
> > +
> > +for (pos = pci_find_capability(dev, PCI_CAP_ID_VNDR);
> 
> Please fix the mixed tabs vs space indentation in this patch.

Will do. There are plenty of these in this patch.

> 
> > +static bool vp_get_shm_region(struct virtio_device *vdev,
> > + struct virtio_shm_region *region, u8 id)
> > +{
> > +   struct virtio_pci_device *vp_dev = to_vp_device(vdev);
> > +   struct pci_dev *pci_dev = vp_dev->pci_dev;
> > +   u8 bar;
> > +   u64 offset, len;
> > +   phys_addr_t phys_addr;
> > +   size_t bar_len;
> > +   int ret;
> > +
> > +   if (!virtio_pci_find_shm_cap(pci_dev, id, , , )) {
> > +   return false;
> > +   }
> > +
> > +   ret = pci_request_region(pci_dev, bar, "virtio-pci-shm");
> > +   if (ret < 0) {
> > +   dev_err(_dev->dev, "%s: failed to request BAR\n",
> > +   __func__);
> > +   return false;
> > +   }
> > +
> > +   phys_addr = pci_resource_start(pci_dev, bar);
> > +   bar_len = pci_resource_len(pci_dev, bar);
> > +
> > +if (offset + len > bar_len) {
> > +dev_err(_dev->dev,
> > +"%s: bar shorter than cap offset+len\n",
> > +__func__);
> > +return false;
> > +}
> > +
> > +   region->len = len;
> > +   region->addr = (u64) phys_addr + offset;
> > +
> > +   return true;
> > +}
> 
> Missing pci_release_region()?

Good catch. We don't have a mechanism to call pci_relese_region() and 
virtio-mmio device's ->get_shm_region() implementation does not even
seem to reserve the resources.

So how about we leave this resource reservation to the caller.
->get_shm_region() just returns the addr/len pair of requested resource.

Something like this patch.

---
 drivers/virtio/virtio_pci_modern.c |8 
 fs/fuse/virtio_fs.c|   13 ++---
 2 files changed, 10 insertions(+), 11 deletions(-)

Index: redhat-linux/fs/fuse/virtio_fs.c
===
--- redhat-linux.orig/fs/fuse/virtio_fs.c   2020-03-10 09:13:34.624565666 
-0400
+++ redhat-linux/fs/fuse/virtio_fs.c2020-03-10 14:11:10.970284651 -0400
@@ -763,11 +763,18 @@ static int virtio_fs_setup_dax(struct vi
if (!have_cache) {
dev_notice(>dev, "%s: No cache capability\n", __func__);
return 0;
-   } else {
-   dev_notice(>dev, "Cache len: 0x%llx @ 0x%llx\n",
-  cache_reg.len, cache_reg.addr);
}
 
+   if (!devm_request_mem_region(>dev, cache_reg.addr, cache_reg.len,
+dev_name(>dev))) {
+   dev_warn(>dev, "could not reserve region addr=0x%llx"
+" len=0x%llx\n", cache_reg.addr, cache_reg.len);
+   return -EBUSY;
+}
+
+   dev_notice(>dev, "Cache len: 0x%llx @ 0x%llx\n", cache_reg.len,
+  cache_reg.addr);
+
pgmap = devm_kzalloc(>dev, sizeof(*pgmap), GFP_KERNEL);
if (!pgmap)
return -ENOMEM;
Index: redhat-linux/drivers/virtio/virtio_pci_modern.c
===
--- redhat-linux.orig/drivers/virtio/virtio_pci_modern.c2020-03-10 
08:51:36.886565666 -0400
+++ redhat-linux/drivers/virtio/virtio_pci_modern.c 2020-03-10 
13:43:15.168753543 -0400
@@ -511,19 +511,11 @@ static bool vp_get_shm_region(struct vir
u64 offset, len;
phys_addr_t phys_addr;
size_t bar_len;
-   int ret;
 
if (!virtio_pci_find_shm_cap(pci_dev, id, , , )) {
return false;
}
 
-   ret = pci_request_region(pci_dev, bar, "virtio-pci-shm");
-   if (ret < 0) {
-   dev_err(_dev->dev, "%s: failed to request BAR\n",
-   __func__);
-   return false;
-   }
-
phys_addr = pci_resource_start(pci_dev, bar);
bar_len = pci_resource_len(pci_dev, bar);
 
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH v6 0/6] dax/pmem: Provide a dax operation to zero page range

2020-03-10 Thread Vivek Goyal

On Fri, Feb 28, 2020 at 11:34:50AM -0500, Vivek Goyal wrote:
> Hi,
> 
> This is V6 of patches. These patches are also available at.

Hi Dan,

Ping. Does this patch series look fine to you?

Vivek

> 
> Changes since V5:
> 
> - Dan Williams preferred ->zero_page_range() to only accept PAGE_SIZE
>   aligned request and clear poison only on page size aligned zeroing. So
>   I changed it accordingly. 
> 
> - Dropped all the modifications which were required to support arbitrary
>   range zeroing with-in a page.
> 
> - This patch series also fixes the issue where "truncate -s 512 foo.txt"
>   will fail if first sector of file is poisoned. Currently it succeeds
>   and filesystem expectes whole of the filesystem block to be free of
>   poison at the end of the operation.
> 
> Christoph, I have dropped your Reviewed-by tag on 1-2 patches because
> these patches changed substantially. Especially signature of of
> dax zero_page_range() helper.
> 
> Thanks
> Vivek
> 
> Vivek Goyal (6):
>   pmem: Add functions for reading/writing page to/from pmem
>   dax, pmem: Add a dax operation zero_page_range
>   s390,dcssblk,dax: Add dax zero_page_range operation to dcssblk driver
>   dm,dax: Add dax zero_page_range operation
>   dax: Use new dax zero page method for zeroing a page
>   dax,iomap: Add helper dax_iomap_zero() to zero a range
> 
>  drivers/dax/super.c   | 20 
>  drivers/md/dm-linear.c| 18 +++
>  drivers/md/dm-log-writes.c| 17 ++
>  drivers/md/dm-stripe.c| 23 +
>  drivers/md/dm.c   | 30 +++
>  drivers/nvdimm/pmem.c | 97 ++-
>  drivers/s390/block/dcssblk.c  | 15 ++
>  fs/dax.c  | 59 ++---
>  fs/iomap/buffered-io.c|  9 +---
>  include/linux/dax.h   | 21 +++-
>  include/linux/device-mapper.h |  3 ++
>  11 files changed, 221 insertions(+), 91 deletions(-)
> 
> -- 
> 2.20.1
> 
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH v6 1/6] pmem: Add functions for reading/writing page to/from pmem

2020-03-04 Thread Vivek Goyal

On Sat, Feb 29, 2020 at 09:04:00AM +0100, Pankaj Gupta wrote:
> On Fri, 28 Feb 2020 at 17:35, Vivek Goyal  wrote:
> >
> > This splits pmem_do_bvec() into pmem_do_read() and pmem_do_write().
> > pmem_do_write() will be used by pmem zero_page_range() as well. Hence
> > sharing the same code.
> >
> > Suggested-by: Christoph Hellwig 
> > Reviewed-by: Christoph Hellwig 
> > Signed-off-by: Vivek Goyal 
> > ---
> >  drivers/nvdimm/pmem.c | 86 +--
> >  1 file changed, 50 insertions(+), 36 deletions(-)
> >
> > diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
> > index 4eae441f86c9..075b11682192 100644
> > --- a/drivers/nvdimm/pmem.c
> > +++ b/drivers/nvdimm/pmem.c
> > @@ -136,9 +136,25 @@ static blk_status_t read_pmem(struct page *page, 
> > unsigned int off,
> > return BLK_STS_OK;
> >  }
> >
> > -static blk_status_t pmem_do_bvec(struct pmem_device *pmem, struct page 
> > *page,
> > -   unsigned int len, unsigned int off, unsigned int op,
> > -   sector_t sector)
> > +static blk_status_t pmem_do_read(struct pmem_device *pmem,
> > +   struct page *page, unsigned int page_off,
> > +   sector_t sector, unsigned int len)
> > +{
> > +   blk_status_t rc;
> > +   phys_addr_t pmem_off = sector * 512 + pmem->data_offset;
> 
> minor nit,  maybe 512 is replaced by macro? Looks like its used at multiple
> places, maybe can keep at is for now.

This came from existing code. If I end up spinning this patch series
again, I will replace it with (sector << SECTOR_SHIFT).

Thanks
Vivek
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

[PATCH 06/20] virtiofs: Provide a helper function for virtqueue initialization

2020-03-04 Thread Vivek Goyal

This reduces code duplication and make it little easier to read code.

Signed-off-by: Vivek Goyal 
---
 fs/fuse/virtio_fs.c | 50 +++--
 1 file changed, 30 insertions(+), 20 deletions(-)

diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
index bade74768903..a16cc9195087 100644
--- a/fs/fuse/virtio_fs.c
+++ b/fs/fuse/virtio_fs.c
@@ -24,6 +24,8 @@ enum {
VQ_REQUEST
 };
 
+#define VQ_NAME_LEN24
+
 /* Per-virtqueue state */
 struct virtio_fs_vq {
spinlock_t lock;
@@ -36,7 +38,7 @@ struct virtio_fs_vq {
bool connected;
long in_flight;
struct completion in_flight_zero; /* No inflight requests */
-   char name[24];
+   char name[VQ_NAME_LEN];
 } cacheline_aligned_in_smp;
 
 /* A virtio-fs device instance */
@@ -560,6 +562,26 @@ static void virtio_fs_vq_done(struct virtqueue *vq)
schedule_work(>done_work);
 }
 
+static void virtio_fs_init_vq(struct virtio_fs_vq *fsvq, char *name,
+ int vq_type)
+{
+   strncpy(fsvq->name, name, VQ_NAME_LEN);
+   spin_lock_init(>lock);
+   INIT_LIST_HEAD(>queued_reqs);
+   INIT_LIST_HEAD(>end_reqs);
+   init_completion(>in_flight_zero);
+
+   if (vq_type == VQ_REQUEST) {
+   INIT_WORK(>done_work, virtio_fs_requests_done_work);
+   INIT_DELAYED_WORK(>dispatch_work,
+ virtio_fs_request_dispatch_work);
+   } else {
+   INIT_WORK(>done_work, virtio_fs_hiprio_done_work);
+   INIT_DELAYED_WORK(>dispatch_work,
+ virtio_fs_hiprio_dispatch_work);
+   }
+}
+
 /* Initialize virtqueues */
 static int virtio_fs_setup_vqs(struct virtio_device *vdev,
   struct virtio_fs *fs)
@@ -575,7 +597,7 @@ static int virtio_fs_setup_vqs(struct virtio_device *vdev,
if (fs->num_request_queues == 0)
return -EINVAL;
 
-   fs->nvqs = 1 + fs->num_request_queues;
+   fs->nvqs = VQ_REQUEST + fs->num_request_queues;
fs->vqs = kcalloc(fs->nvqs, sizeof(fs->vqs[VQ_HIPRIO]), GFP_KERNEL);
if (!fs->vqs)
return -ENOMEM;
@@ -589,29 +611,17 @@ static int virtio_fs_setup_vqs(struct virtio_device *vdev,
goto out;
}
 
+   /* Initialize the hiprio/forget request virtqueue */
callbacks[VQ_HIPRIO] = virtio_fs_vq_done;
-   snprintf(fs->vqs[VQ_HIPRIO].name, sizeof(fs->vqs[VQ_HIPRIO].name),
-   "hiprio");
+   virtio_fs_init_vq(>vqs[VQ_HIPRIO], "hiprio", VQ_HIPRIO);
names[VQ_HIPRIO] = fs->vqs[VQ_HIPRIO].name;
-   INIT_WORK(>vqs[VQ_HIPRIO].done_work, virtio_fs_hiprio_done_work);
-   INIT_LIST_HEAD(>vqs[VQ_HIPRIO].queued_reqs);
-   INIT_LIST_HEAD(>vqs[VQ_HIPRIO].end_reqs);
-   INIT_DELAYED_WORK(>vqs[VQ_HIPRIO].dispatch_work,
-   virtio_fs_hiprio_dispatch_work);
-   init_completion(>vqs[VQ_HIPRIO].in_flight_zero);
-   spin_lock_init(>vqs[VQ_HIPRIO].lock);
 
/* Initialize the requests virtqueues */
for (i = VQ_REQUEST; i < fs->nvqs; i++) {
-   spin_lock_init(>vqs[i].lock);
-   INIT_WORK(>vqs[i].done_work, virtio_fs_requests_done_work);
-   INIT_DELAYED_WORK(>vqs[i].dispatch_work,
- virtio_fs_request_dispatch_work);
-   INIT_LIST_HEAD(>vqs[i].queued_reqs);
-   INIT_LIST_HEAD(>vqs[i].end_reqs);
-   init_completion(>vqs[i].in_flight_zero);
-   snprintf(fs->vqs[i].name, sizeof(fs->vqs[i].name),
-"requests.%u", i - VQ_REQUEST);
+   char vq_name[VQ_NAME_LEN];
+
+   snprintf(vq_name, VQ_NAME_LEN, "requests.%u", i - VQ_REQUEST);
+   virtio_fs_init_vq(>vqs[i], vq_name, VQ_REQUEST);
callbacks[i] = virtio_fs_vq_done;
names[i] = fs->vqs[i].name;
}
-- 
2.20.1
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

[PATCH 12/20] fuse: Introduce setupmapping/removemapping commands

2020-03-04 Thread Vivek Goyal

Introduce two new fuse commands to setup/remove memory mappings. This
will be used to setup/tear down file mapping in dax window.

Signed-off-by: Vivek Goyal 
Signed-off-by: Peng Tao 
---
 include/uapi/linux/fuse.h | 37 +
 1 file changed, 37 insertions(+)

diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
index 5b85819e045f..62633555d547 100644
--- a/include/uapi/linux/fuse.h
+++ b/include/uapi/linux/fuse.h
@@ -894,4 +894,41 @@ struct fuse_copy_file_range_in {
uint64_tflags;
 };
 
+#define FUSE_SETUPMAPPING_ENTRIES 8
+#define FUSE_SETUPMAPPING_FLAG_WRITE (1ull << 0)
+struct fuse_setupmapping_in {
+   /* An already open handle */
+   uint64_tfh;
+   /* Offset into the file to start the mapping */
+   uint64_tfoffset;
+   /* Length of mapping required */
+   uint64_tlen;
+   /* Flags, FUSE_SETUPMAPPING_FLAG_* */
+   uint64_tflags;
+   /* Offset in Memory Window */
+   uint64_tmoffset;
+};
+
+struct fuse_setupmapping_out {
+   /* Offsets into the cache of mappings */
+   uint64_tcoffset[FUSE_SETUPMAPPING_ENTRIES];
+/* Lengths of each mapping */
+uint64_t   len[FUSE_SETUPMAPPING_ENTRIES];
+};
+
+struct fuse_removemapping_in {
+   /* number of fuse_removemapping_one follows */
+   uint32_tcount;
+};
+
+struct fuse_removemapping_one {
+   /* Offset into the dax window start the unmapping */
+   uint64_tmoffset;
+/* Length of mapping required */
+uint64_t   len;
+};
+
+#define FUSE_REMOVEMAPPING_MAX_ENTRY   \
+   (PAGE_SIZE / sizeof(struct fuse_removemapping_one))
+
 #endif /* _LINUX_FUSE_H */
-- 
2.20.1
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

[PATCH 15/20] fuse, dax: Take ->i_mmap_sem lock during dax page fault

2020-03-04 Thread Vivek Goyal

We need some kind of locking mechanism here. Normal file systems like
ext4 and xfs seems to take their own semaphore to protect agains
truncate while fault is going on.

We have additional requirement to protect against fuse dax memory range
reclaim. When a range has been selected for reclaim, we need to make sure
no other read/write/fault can try to access that memory range while
reclaim is in progress. Once reclaim is complete, lock will be released
and read/write/fault will trigger allocation of fresh dax range.

Taking inode_lock() is not an option in fault path as lockdep complains
about circular dependencies. So define a new fuse_inode->i_mmap_sem.

Signed-off-by: Vivek Goyal 
---
 fs/fuse/dir.c|  2 ++
 fs/fuse/file.c   | 15 ---
 fs/fuse/fuse_i.h |  7 +++
 fs/fuse/inode.c  |  1 +
 4 files changed, 22 insertions(+), 3 deletions(-)

diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index de1e2fde60bd..ad699a60ec03 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -1609,8 +1609,10 @@ int fuse_do_setattr(struct dentry *dentry, struct iattr 
*attr,
 */
if ((is_truncate || !is_wb) &&
S_ISREG(inode->i_mode) && oldsize != outarg.attr.size) {
+   down_write(>i_mmap_sem);
truncate_pagecache(inode, outarg.attr.size);
invalidate_inode_pages2(inode->i_mapping);
+   up_write(>i_mmap_sem);
}
 
clear_bit(FUSE_I_SIZE_UNSTABLE, >state);
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 303496e6617f..ab56396cf661 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -2907,11 +2907,18 @@ static vm_fault_t __fuse_dax_fault(struct vm_fault *vmf,
 
if (write)
sb_start_pagefault(sb);
-
+   /*
+* We need to serialize against not only truncate but also against
+* fuse dax memory range reclaim. While a range is being reclaimed,
+* we do not want any read/write/mmap to make progress and try
+* to populate page cache or access memory we are trying to free.
+*/
+   down_read(_fuse_inode(inode)->i_mmap_sem);
ret = dax_iomap_fault(vmf, pe_size, , NULL, _iomap_ops);
 
if (ret & VM_FAULT_NEEDDSYNC)
ret = dax_finish_sync_fault(vmf, pe_size, pfn);
+   up_read(_fuse_inode(inode)->i_mmap_sem);
 
if (write)
sb_end_pagefault(sb);
@@ -3869,9 +3876,11 @@ static long fuse_file_fallocate(struct file *file, int 
mode, loff_t offset,
file_update_time(file);
}
 
-   if (mode & FALLOC_FL_PUNCH_HOLE)
+   if (mode & FALLOC_FL_PUNCH_HOLE) {
+   down_write(>i_mmap_sem);
truncate_pagecache_range(inode, offset, offset + length - 1);
-
+   up_write(>i_mmap_sem);
+   }
fuse_invalidate_attr(inode);
 
 out:
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 490549862bda..3fea84411401 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -186,6 +186,13 @@ struct fuse_inode {
 */
struct rw_semaphore i_dmap_sem;
 
+   /**
+* Can't take inode lock in fault path (leads to circular dependency).
+* So take this in fuse dax fault path to make sure truncate and
+* punch hole etc. can't make progress in parallel.
+*/
+   struct rw_semaphore i_mmap_sem;
+
/** Sorted rb tree of struct fuse_dax_mapping elements */
struct rb_root_cached dmap_tree;
unsigned long nr_dmaps;
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 93bc65607a15..abc881e6acb0 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -88,6 +88,7 @@ static struct inode *fuse_alloc_inode(struct super_block *sb)
fi->state = 0;
fi->nr_dmaps = 0;
mutex_init(>mutex);
+   init_rwsem(>i_mmap_sem);
init_rwsem(>i_dmap_sem);
spin_lock_init(>lock);
fi->forget = fuse_alloc_forget();
-- 
2.20.1
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

[PATCH 05/20] virtio: Implement get_shm_region for MMIO transport

2020-03-04 Thread Vivek Goyal

From: Sebastien Boeuf 

On MMIO a new set of registers is defined for finding SHM
regions.  Add their definitions and use them to find the region.

Signed-off-by: Sebastien Boeuf 
---
 drivers/virtio/virtio_mmio.c | 32 
 include/uapi/linux/virtio_mmio.h | 11 +++
 2 files changed, 43 insertions(+)

diff --git a/drivers/virtio/virtio_mmio.c b/drivers/virtio/virtio_mmio.c
index 97d5725fd9a2..4922a1a9e3a7 100644
--- a/drivers/virtio/virtio_mmio.c
+++ b/drivers/virtio/virtio_mmio.c
@@ -500,6 +500,37 @@ static const char *vm_bus_name(struct virtio_device *vdev)
return vm_dev->pdev->name;
 }
 
+static bool vm_get_shm_region(struct virtio_device *vdev,
+ struct virtio_shm_region *region, u8 id)
+{
+   struct virtio_mmio_device *vm_dev = to_virtio_mmio_device(vdev);
+   u64 len, addr;
+
+   /* Select the region we're interested in */
+   writel(id, vm_dev->base + VIRTIO_MMIO_SHM_SEL);
+
+   /* Read the region size */
+   len = (u64) readl(vm_dev->base + VIRTIO_MMIO_SHM_LEN_LOW);
+   len |= (u64) readl(vm_dev->base + VIRTIO_MMIO_SHM_LEN_HIGH) << 32;
+
+   region->len = len;
+
+   /* Check if region length is -1. If that's the case, the shared memory
+* region does not exist and there is no need to proceed further.
+*/
+   if (len == ~(u64)0) {
+   return false;
+   }
+
+   /* Read the region base address */
+   addr = (u64) readl(vm_dev->base + VIRTIO_MMIO_SHM_BASE_LOW);
+   addr |= (u64) readl(vm_dev->base + VIRTIO_MMIO_SHM_BASE_HIGH) << 32;
+
+   region->addr = addr;
+
+   return true;
+}
+
 static const struct virtio_config_ops virtio_mmio_config_ops = {
.get= vm_get,
.set= vm_set,
@@ -512,6 +543,7 @@ static const struct virtio_config_ops 
virtio_mmio_config_ops = {
.get_features   = vm_get_features,
.finalize_features = vm_finalize_features,
.bus_name   = vm_bus_name,
+   .get_shm_region = vm_get_shm_region,
 };
 
 
diff --git a/include/uapi/linux/virtio_mmio.h b/include/uapi/linux/virtio_mmio.h
index c4b09689ab64..0650f91bea6c 100644
--- a/include/uapi/linux/virtio_mmio.h
+++ b/include/uapi/linux/virtio_mmio.h
@@ -122,6 +122,17 @@
 #define VIRTIO_MMIO_QUEUE_USED_LOW 0x0a0
 #define VIRTIO_MMIO_QUEUE_USED_HIGH0x0a4
 
+/* Shared memory region id */
+#define VIRTIO_MMIO_SHM_SEL 0x0ac
+
+/* Shared memory region length, 64 bits in two halves */
+#define VIRTIO_MMIO_SHM_LEN_LOW 0x0b0
+#define VIRTIO_MMIO_SHM_LEN_HIGH0x0b4
+
+/* Shared memory region base address, 64 bits in two halves */
+#define VIRTIO_MMIO_SHM_BASE_LOW0x0b8
+#define VIRTIO_MMIO_SHM_BASE_HIGH   0x0bc
+
 /* Configuration atomicity value */
 #define VIRTIO_MMIO_CONFIG_GENERATION  0x0fc
 
-- 
2.20.1
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

[PATCH 07/20] fuse: Get rid of no_mount_options

2020-03-04 Thread Vivek Goyal

This option was introduced so that for virtio_fs we don't show any mounts
options fuse_show_options(). Because we don't offer any of these options
to be controlled by mounter.

Very soon we are planning to introduce option "dax" which mounter should
be able to specify. And no_mount_options does not work anymore. What
we need is a per mount option specific flag so that fileystem can
specify which options to show.

Add few such flags to control the behavior in more fine grained manner
and get rid of no_mount_options.

Signed-off-by: Vivek Goyal 
---
 fs/fuse/fuse_i.h| 14 ++
 fs/fuse/inode.c | 22 ++
 fs/fuse/virtio_fs.c |  1 -
 3 files changed, 24 insertions(+), 13 deletions(-)

diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index aa75e2305b75..2cebdf6dcfd8 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -468,18 +468,21 @@ struct fuse_fs_context {
int fd;
unsigned int rootmode;
kuid_t user_id;
+   bool user_id_show;
kgid_t group_id;
+   bool group_id_show;
bool is_bdev:1;
bool fd_present:1;
bool rootmode_present:1;
bool user_id_present:1;
bool group_id_present:1;
bool default_permissions:1;
+   bool default_permissions_show:1;
bool allow_other:1;
+   bool allow_other_show:1;
bool destroy:1;
bool no_control:1;
bool no_force_umount:1;
-   bool no_mount_options:1;
unsigned int max_read;
unsigned int blksize;
const char *subtype;
@@ -509,9 +512,11 @@ struct fuse_conn {
 
/** The user id for this mount */
kuid_t user_id;
+   bool user_id_show:1;
 
/** The group id for this mount */
kgid_t group_id;
+   bool group_id_show:1;
 
/** The pid namespace for this mount */
struct pid_namespace *pid_ns;
@@ -695,10 +700,14 @@ struct fuse_conn {
 
/** Check permissions based on the file mode or not? */
unsigned default_permissions:1;
+   bool default_permissions_show:1;
 
/** Allow other than the mounter user to access the filesystem ? */
unsigned allow_other:1;
 
+   /** Show allow_other in mount options */
+   bool allow_other_show:1;
+
/** Does the filesystem support copy_file_range? */
unsigned no_copy_file_range:1;
 
@@ -714,9 +723,6 @@ struct fuse_conn {
/** Do not allow MNT_FORCE umount */
unsigned int no_force_umount:1;
 
-   /* Do not show mount options */
-   unsigned int no_mount_options:1;
-
/** The number of requests waiting for completion */
atomic_t num_waiting;
 
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 95d712d44ca1..f160a3d47b63 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -515,10 +515,12 @@ static int fuse_parse_param(struct fs_context *fc, struct 
fs_parameter *param)
 
case OPT_DEFAULT_PERMISSIONS:
ctx->default_permissions = true;
+   ctx->default_permissions_show = true;
break;
 
case OPT_ALLOW_OTHER:
ctx->allow_other = true;
+   ctx->allow_other_show = true;
break;
 
case OPT_MAX_READ:
@@ -553,14 +555,15 @@ static int fuse_show_options(struct seq_file *m, struct 
dentry *root)
struct super_block *sb = root->d_sb;
struct fuse_conn *fc = get_fuse_conn_super(sb);
 
-   if (fc->no_mount_options)
-   return 0;
-
-   seq_printf(m, ",user_id=%u", from_kuid_munged(fc->user_ns, 
fc->user_id));
-   seq_printf(m, ",group_id=%u", from_kgid_munged(fc->user_ns, 
fc->group_id));
-   if (fc->default_permissions)
+   if (fc->user_id_show)
+   seq_printf(m, ",user_id=%u",
+  from_kuid_munged(fc->user_ns, fc->user_id));
+   if (fc->group_id_show)
+   seq_printf(m, ",group_id=%u",
+  from_kgid_munged(fc->user_ns, fc->group_id));
+   if (fc->default_permissions && fc->default_permissions_show)
seq_puts(m, ",default_permissions");
-   if (fc->allow_other)
+   if (fc->allow_other && fc->allow_other_show)
seq_puts(m, ",allow_other");
if (fc->max_read != ~0)
seq_printf(m, ",max_read=%u", fc->max_read);
@@ -1171,14 +1174,17 @@ int fuse_fill_super_common(struct super_block *sb, 
struct fuse_fs_context *ctx)
sb->s_flags |= SB_POSIXACL;
 
fc->default_permissions = ctx->default_permissions;
+   fc->default_permissions_show = ctx->default_permissions_show;
fc->allow_other = ctx->allow_other;
+   fc->allow_other_show = ctx->allow_other_show;
fc->user_id = ctx->user_id;
+   fc->user_id_show = ctx->user_id_show

[PATCH 09/20] virtio_fs, dax: Set up virtio_fs dax_device

2020-03-04 Thread Vivek Goyal

From: Stefan Hajnoczi 

Setup a dax device.

Use the shm capability to find the cache entry and map it.

The DAX window is accessed by the fs/dax.c infrastructure and must have
struct pages (at least on x86).  Use devm_memremap_pages() to map the
DAX window PCI BAR and allocate struct page.

Signed-off-by: Stefan Hajnoczi 
Signed-off-by: Dr. David Alan Gilbert 
Signed-off-by: Vivek Goyal 
Signed-off-by: Sebastien Boeuf 
Signed-off-by: Liu Bo 
---
 fs/fuse/virtio_fs.c| 115 +
 include/uapi/linux/virtio_fs.h |   3 +
 2 files changed, 118 insertions(+)

diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
index 62cdd6817b5b..b0574b208cd5 100644
--- a/fs/fuse/virtio_fs.c
+++ b/fs/fuse/virtio_fs.c
@@ -5,6 +5,9 @@
  */
 
 #include 
+#include 
+#include 
+#include 
 #include 
 #include 
 #include 
@@ -50,6 +53,12 @@ struct virtio_fs {
struct virtio_fs_vq *vqs;
unsigned int nvqs;   /* number of virtqueues */
unsigned int num_request_queues; /* number of request queues */
+   struct dax_device *dax_dev;
+
+   /* DAX memory window where file contents are mapped */
+   void *window_kaddr;
+   phys_addr_t window_phys_addr;
+   size_t window_len;
 };
 
 struct virtio_fs_forget_req {
@@ -690,6 +699,108 @@ static void virtio_fs_cleanup_vqs(struct virtio_device 
*vdev,
vdev->config->del_vqs(vdev);
 }
 
+/* Map a window offset to a page frame number.  The window offset will have
+ * been produced by .iomap_begin(), which maps a file offset to a window
+ * offset.
+ */
+static long virtio_fs_direct_access(struct dax_device *dax_dev, pgoff_t pgoff,
+   long nr_pages, void **kaddr, pfn_t *pfn)
+{
+   struct virtio_fs *fs = dax_get_private(dax_dev);
+   phys_addr_t offset = PFN_PHYS(pgoff);
+   size_t max_nr_pages = fs->window_len/PAGE_SIZE - pgoff;
+
+   if (kaddr)
+   *kaddr = fs->window_kaddr + offset;
+   if (pfn)
+   *pfn = phys_to_pfn_t(fs->window_phys_addr + offset,
+   PFN_DEV | PFN_MAP);
+   return nr_pages > max_nr_pages ? max_nr_pages : nr_pages;
+}
+
+static size_t virtio_fs_copy_from_iter(struct dax_device *dax_dev,
+  pgoff_t pgoff, void *addr,
+  size_t bytes, struct iov_iter *i)
+{
+   return copy_from_iter(addr, bytes, i);
+}
+
+static size_t virtio_fs_copy_to_iter(struct dax_device *dax_dev,
+  pgoff_t pgoff, void *addr,
+  size_t bytes, struct iov_iter *i)
+{
+   return copy_to_iter(addr, bytes, i);
+}
+
+static const struct dax_operations virtio_fs_dax_ops = {
+   .direct_access = virtio_fs_direct_access,
+   .copy_from_iter = virtio_fs_copy_from_iter,
+   .copy_to_iter = virtio_fs_copy_to_iter,
+};
+
+static void virtio_fs_cleanup_dax(void *data)
+{
+   struct virtio_fs *fs = data;
+
+   kill_dax(fs->dax_dev);
+   put_dax(fs->dax_dev);
+}
+
+static int virtio_fs_setup_dax(struct virtio_device *vdev, struct virtio_fs 
*fs)
+{
+   struct virtio_shm_region cache_reg;
+   struct dev_pagemap *pgmap;
+   bool have_cache;
+
+   if (!IS_ENABLED(CONFIG_DAX_DRIVER))
+   return 0;
+
+   /* Get cache region */
+   have_cache = virtio_get_shm_region(vdev, _reg,
+  (u8)VIRTIO_FS_SHMCAP_ID_CACHE);
+   if (!have_cache) {
+   dev_notice(>dev, "%s: No cache capability\n", __func__);
+   return 0;
+   } else {
+   dev_notice(>dev, "Cache len: 0x%llx @ 0x%llx\n",
+  cache_reg.len, cache_reg.addr);
+   }
+
+   pgmap = devm_kzalloc(>dev, sizeof(*pgmap), GFP_KERNEL);
+   if (!pgmap)
+   return -ENOMEM;
+
+   pgmap->type = MEMORY_DEVICE_FS_DAX;
+
+   /* Ideally we would directly use the PCI BAR resource but
+* devm_memremap_pages() wants its own copy in pgmap.  So
+* initialize a struct resource from scratch (only the start
+* and end fields will be used).
+*/
+   pgmap->res = (struct resource){
+   .name = "virtio-fs dax window",
+   .start = (phys_addr_t) cache_reg.addr,
+   .end = (phys_addr_t) cache_reg.addr + cache_reg.len - 1,
+   };
+
+   fs->window_kaddr = devm_memremap_pages(>dev, pgmap);
+   if (IS_ERR(fs->window_kaddr))
+   return PTR_ERR(fs->window_kaddr);
+
+   fs->window_phys_addr = (phys_addr_t) cache_reg.addr;
+   fs->window_len = (phys_addr_t) cache_reg.len;
+
+   dev_dbg(>dev, "%s: window kaddr 0x%px phys_addr 0x%llx"
+   " len 0x%llx\n", __func__, fs->window_kaddr, cache_reg.addr,
+   cache_reg.le

[PATCH 14/20] fuse,dax: add DAX mmap support

2020-03-04 Thread Vivek Goyal

From: Stefan Hajnoczi 

Add DAX mmap() support.

Signed-off-by: Stefan Hajnoczi 
---
 fs/fuse/file.c | 62 +-
 1 file changed, 61 insertions(+), 1 deletion(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 9effdd3dc6d6..303496e6617f 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -2870,10 +2870,15 @@ static const struct vm_operations_struct 
fuse_file_vm_ops = {
.page_mkwrite   = fuse_page_mkwrite,
 };
 
+static int fuse_dax_mmap(struct file *file, struct vm_area_struct *vma);
 static int fuse_file_mmap(struct file *file, struct vm_area_struct *vma)
 {
struct fuse_file *ff = file->private_data;
 
+   /* DAX mmap is superior to direct_io mmap */
+   if (IS_DAX(file_inode(file)))
+   return fuse_dax_mmap(file, vma);
+
if (ff->open_flags & FOPEN_DIRECT_IO) {
/* Can't provide the coherency needed for MAP_SHARED */
if (vma->vm_flags & VM_MAYSHARE)
@@ -2892,9 +2897,63 @@ static int fuse_file_mmap(struct file *file, struct 
vm_area_struct *vma)
return 0;
 }
 
+static vm_fault_t __fuse_dax_fault(struct vm_fault *vmf,
+  enum page_entry_size pe_size, bool write)
+{
+   vm_fault_t ret;
+   struct inode *inode = file_inode(vmf->vma->vm_file);
+   struct super_block *sb = inode->i_sb;
+   pfn_t pfn;
+
+   if (write)
+   sb_start_pagefault(sb);
+
+   ret = dax_iomap_fault(vmf, pe_size, , NULL, _iomap_ops);
+
+   if (ret & VM_FAULT_NEEDDSYNC)
+   ret = dax_finish_sync_fault(vmf, pe_size, pfn);
+
+   if (write)
+   sb_end_pagefault(sb);
+
+   return ret;
+}
+
+static vm_fault_t fuse_dax_fault(struct vm_fault *vmf)
+{
+   return __fuse_dax_fault(vmf, PE_SIZE_PTE,
+   vmf->flags & FAULT_FLAG_WRITE);
+}
+
+static vm_fault_t fuse_dax_huge_fault(struct vm_fault *vmf,
+  enum page_entry_size pe_size)
+{
+   return __fuse_dax_fault(vmf, pe_size, vmf->flags & FAULT_FLAG_WRITE);
+}
+
+static vm_fault_t fuse_dax_page_mkwrite(struct vm_fault *vmf)
+{
+   return __fuse_dax_fault(vmf, PE_SIZE_PTE, true);
+}
+
+static vm_fault_t fuse_dax_pfn_mkwrite(struct vm_fault *vmf)
+{
+   return __fuse_dax_fault(vmf, PE_SIZE_PTE, true);
+}
+
+static const struct vm_operations_struct fuse_dax_vm_ops = {
+   .fault  = fuse_dax_fault,
+   .huge_fault = fuse_dax_huge_fault,
+   .page_mkwrite   = fuse_dax_page_mkwrite,
+   .pfn_mkwrite= fuse_dax_pfn_mkwrite,
+};
+
 static int fuse_dax_mmap(struct file *file, struct vm_area_struct *vma)
 {
-   return -EINVAL; /* TODO */
+   file_accessed(file);
+   vma->vm_ops = _dax_vm_ops;
+   vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE;
+   return 0;
 }
 
 static int convert_fuse_file_lock(struct fuse_conn *fc,
@@ -3940,6 +3999,7 @@ static const struct file_operations fuse_file_operations 
= {
.release= fuse_release,
.fsync  = fuse_fsync,
.lock   = fuse_file_lock,
+   .get_unmapped_area = thp_get_unmapped_area,
.flock  = fuse_file_flock,
.splice_read= generic_file_splice_read,
.splice_write   = iter_file_splice_write,
-- 
2.20.1
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

[PATCH 03/20] virtio: Add get_shm_region method

2020-03-04 Thread Vivek Goyal

From: Sebastien Boeuf 

Virtio defines 'shared memory regions' that provide a continuously
shared region between the host and guest.

Provide a method to find a particular region on a device.

Signed-off-by: Sebastien Boeuf 
Signed-off-by: Dr. David Alan Gilbert 
---
 include/linux/virtio_config.h | 17 +
 1 file changed, 17 insertions(+)

diff --git a/include/linux/virtio_config.h b/include/linux/virtio_config.h
index bb4cc4910750..c859f000a751 100644
--- a/include/linux/virtio_config.h
+++ b/include/linux/virtio_config.h
@@ -10,6 +10,11 @@
 
 struct irq_affinity;
 
+struct virtio_shm_region {
+   u64 addr;
+   u64 len;
+};
+
 /**
  * virtio_config_ops - operations for configuring a virtio device
  * Note: Do not assume that a transport implements all of the operations
@@ -65,6 +70,7 @@ struct irq_affinity;
  *  the caller can then copy.
  * @set_vq_affinity: set the affinity for a virtqueue (optional).
  * @get_vq_affinity: get the affinity for a virtqueue (optional).
+ * @get_shm_region: get a shared memory region based on the index.
  */
 typedef void vq_callback_t(struct virtqueue *);
 struct virtio_config_ops {
@@ -88,6 +94,8 @@ struct virtio_config_ops {
   const struct cpumask *cpu_mask);
const struct cpumask *(*get_vq_affinity)(struct virtio_device *vdev,
int index);
+   bool (*get_shm_region)(struct virtio_device *vdev,
+  struct virtio_shm_region *region, u8 id);
 };
 
 /* If driver didn't advertise the feature, it will never appear. */
@@ -250,6 +258,15 @@ int virtqueue_set_affinity(struct virtqueue *vq, const 
struct cpumask *cpu_mask)
return 0;
 }
 
+static inline
+bool virtio_get_shm_region(struct virtio_device *vdev,
+ struct virtio_shm_region *region, u8 id)
+{
+   if (!vdev->config->get_shm_region)
+   return false;
+   return vdev->config->get_shm_region(vdev, region, id);
+}
+
 static inline bool virtio_is_little_endian(struct virtio_device *vdev)
 {
return virtio_has_feature(vdev, VIRTIO_F_VERSION_1) ||
-- 
2.20.1
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

[PATCH 18/20] fuse: Release file in process context

2020-03-04 Thread Vivek Goyal

fuse_file_put(sync) can be called with sync=true/false. If sync=true,
it waits for release request response and then calls iput() in the
caller's context. If sync=false, it does not wait for release request
response, frees the fuse_file struct immediately and req->end function
does the iput().

iput() can be a problem with DAX if called in req->end context. If this
is last reference to inode (VFS has let go its reference already), then
iput() will clean DAX mappings as well and send REMOVEMAPPING requests
and wait for completion. (All the the worker thread context which is
processing fuse replies from daemon on the host).

That means it blocks worker thread and it stops processing further
replies and system deadlocks.

So for now, force sync release of file in case of DAX inodes.

Signed-off-by: Vivek Goyal 
---
 fs/fuse/file.c | 14 +-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index afabeb1acd50..561428b66101 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -543,6 +543,7 @@ void fuse_release_common(struct file *file, bool isdir)
struct fuse_file *ff = file->private_data;
struct fuse_release_args *ra = ff->release_args;
int opcode = isdir ? FUSE_RELEASEDIR : FUSE_RELEASE;
+   bool sync = false;
 
fuse_prepare_release(fi, ff, file->f_flags, opcode);
 
@@ -562,8 +563,19 @@ void fuse_release_common(struct file *file, bool isdir)
 * Make the release synchronous if this is a fuseblk mount,
 * synchronous RELEASE is allowed (and desirable) in this case
 * because the server can be trusted not to screw up.
+*
+* For DAX, fuse server is trusted. So it should be fine to
+* do a sync file put. Doing async file put is creating
+* problems right now because when request finish, iput()
+* can lead to freeing of inode. That means it tears down
+* mappings backing DAX memory and sends REMOVEMAPPING message
+* to server and blocks for completion. Currently, waiting
+* in req->end context deadlocks the system as same worker thread
+* can't process REMOVEMAPPING reply it is waiting for.
 */
-   fuse_file_put(ff, ff->fc->destroy, isdir);
+   if (IS_DAX(file_inode(file)) || ff->fc->destroy)
+   sync = true;
+   fuse_file_put(ff, sync, isdir);
 }
 
 static int fuse_open(struct inode *inode, struct file *file)
-- 
2.20.1
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

[PATCH 20/20] fuse,virtiofs: Add logic to free up a memory range

2020-03-04 Thread Vivek Goyal

Add logic to free up a busy memory range. Freed memory range will be
returned to free pool. Add a worker which can be started to select
and free some busy memory ranges.

Process can also steal one of its busy dax ranges if free range is not
available. I will refer it to as direct reclaim.

If free range is not available and nothing can't be stolen from same
inode, caller waits on a waitq for free range to become available.

For reclaiming a range, as of now we need to hold following locks in
specified order.

down_write(>i_mmap_sem);
down_write(>i_dmap_sem);

We look for a free range in following order.

A. Try to get a free range.
B. If not, try direct reclaim.
C. If not, wait for a memory range to become free

Signed-off-by: Vivek Goyal 
Signed-off-by: Liu Bo 
---
 fs/fuse/file.c   | 450 ++-
 fs/fuse/fuse_i.h |  25 +++
 fs/fuse/inode.c  |   5 +
 3 files changed, 473 insertions(+), 7 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 8b264fcb9b3c..61ae2ddeef55 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -8,6 +8,7 @@
 
 #include "fuse_i.h"
 
+#include 
 #include 
 #include 
 #include 
@@ -37,6 +38,8 @@ static struct page **fuse_pages_alloc(unsigned int npages, 
gfp_t flags,
return pages;
 }
 
+static struct fuse_dax_mapping *alloc_dax_mapping_reclaim(struct fuse_conn *fc,
+   struct inode *inode, bool fault);
 static int fuse_send_open(struct fuse_conn *fc, u64 nodeid, struct file *file,
  int opcode, struct fuse_open_out *outargp)
 {
@@ -193,6 +196,28 @@ static void fuse_link_write_file(struct file *file)
spin_unlock(>lock);
 }
 
+static void
+__kick_dmap_free_worker(struct fuse_conn *fc, unsigned long delay_ms)
+{
+   unsigned long free_threshold;
+
+   /* If number of free ranges are below threshold, start reclaim */
+   free_threshold = max((fc->nr_ranges * FUSE_DAX_RECLAIM_THRESHOLD)/100,
+   (unsigned long)1);
+   if (fc->nr_free_ranges < free_threshold) {
+   pr_debug("fuse: Kicking dax memory reclaim worker. 
nr_free_ranges=0x%ld nr_total_ranges=%ld\n", fc->nr_free_ranges, fc->nr_ranges);
+   queue_delayed_work(system_long_wq, >dax_free_work,
+  msecs_to_jiffies(delay_ms));
+   }
+}
+
+static void kick_dmap_free_worker(struct fuse_conn *fc, unsigned long delay_ms)
+{
+   spin_lock(>lock);
+   __kick_dmap_free_worker(fc, delay_ms);
+   spin_unlock(>lock);
+}
+
 static struct fuse_dax_mapping *alloc_dax_mapping(struct fuse_conn *fc)
 {
struct fuse_dax_mapping *dmap = NULL;
@@ -201,7 +226,7 @@ static struct fuse_dax_mapping *alloc_dax_mapping(struct 
fuse_conn *fc)
 
if (fc->nr_free_ranges <= 0) {
spin_unlock(>lock);
-   return NULL;
+   goto out_kick;
}
 
WARN_ON(list_empty(>free_ranges));
@@ -212,6 +237,9 @@ static struct fuse_dax_mapping *alloc_dax_mapping(struct 
fuse_conn *fc)
list_del_init(>list);
fc->nr_free_ranges--;
spin_unlock(>lock);
+
+out_kick:
+   kick_dmap_free_worker(fc, 0);
return dmap;
 }
 
@@ -238,6 +266,7 @@ static void __dmap_add_to_free_pool(struct fuse_conn *fc,
 {
list_add_tail(>list, >free_ranges);
fc->nr_free_ranges++;
+   wake_up(>dax_range_waitq);
 }
 
 static void dmap_add_to_free_pool(struct fuse_conn *fc,
@@ -289,6 +318,12 @@ static int fuse_setup_one_mapping(struct inode *inode, 
loff_t offset,
 
dmap->writable = writable;
if (!upgrade) {
+   /*
+* We don't take a refernce on inode. inode is valid right now
+* and when inode is going away, cleanup logic should first
+* cleanup dmap entries.
+*/
+   dmap->inode = inode;
dmap->start = offset;
dmap->end = offset + FUSE_DAX_MEM_RANGE_SZ - 1;
/* Protected by fi->i_dmap_sem */
@@ -368,6 +403,7 @@ static void dmap_reinit_add_to_free_pool(struct fuse_conn 
*fc,
 "window_offset=0x%llx length=0x%llx\n", dmap->start,
 dmap->end, dmap->window_offset, dmap->length);
__dmap_remove_busy_list(fc, dmap);
+   dmap->inode = NULL;
dmap->start = dmap->end = 0;
__dmap_add_to_free_pool(fc, dmap);
 }
@@ -386,7 +422,8 @@ static void inode_reclaim_dmap_range(struct fuse_conn *fc, 
struct inode *inode,
int err, num = 0;
LIST_HEAD(to_remove);
 
-   pr_debug("fuse: %s: start=0x%llx, end=0x%llx\n", __func__, start, end);
+   pr_debug("fuse: %s: inode=0x%px start=0x%llx, end=0x%llx\n", __func__,
+inode, start, end);
 
/*
 * Interval tree s

[PATCH 08/20] fuse,virtiofs: Add a mount option to enable dax

2020-03-04 Thread Vivek Goyal

Add a mount option to allow using dax with virtio_fs.

Signed-off-by: Vivek Goyal 
---
 fs/fuse/fuse_i.h|  7 
 fs/fuse/inode.c |  3 ++
 fs/fuse/virtio_fs.c | 82 +
 3 files changed, 78 insertions(+), 14 deletions(-)

diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 2cebdf6dcfd8..1fe5065a2902 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -483,10 +483,14 @@ struct fuse_fs_context {
bool destroy:1;
bool no_control:1;
bool no_force_umount:1;
+   bool dax:1;
unsigned int max_read;
unsigned int blksize;
const char *subtype;
 
+   /* DAX device, may be NULL */
+   struct dax_device *dax_dev;
+
/* fuse_dev pointer to fill in, should contain NULL on entry */
void **fudptr;
 };
@@ -758,6 +762,9 @@ struct fuse_conn {
 
/** List of device instances belonging to this connection */
struct list_head devices;
+
+   /** DAX device, non-NULL if DAX is supported */
+   struct dax_device *dax_dev;
 };
 
 static inline struct fuse_conn *get_fuse_conn_super(struct super_block *sb)
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index f160a3d47b63..84295fac4ff3 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -569,6 +569,8 @@ static int fuse_show_options(struct seq_file *m, struct 
dentry *root)
seq_printf(m, ",max_read=%u", fc->max_read);
if (sb->s_bdev && sb->s_blocksize != FUSE_DEFAULT_BLKSIZE)
seq_printf(m, ",blksize=%lu", sb->s_blocksize);
+   if (fc->dax_dev)
+   seq_printf(m, ",dax");
return 0;
 }
 
@@ -1185,6 +1187,7 @@ int fuse_fill_super_common(struct super_block *sb, struct 
fuse_fs_context *ctx)
fc->destroy = ctx->destroy;
fc->no_control = ctx->no_control;
fc->no_force_umount = ctx->no_force_umount;
+   fc->dax_dev = ctx->dax_dev;
 
err = -ENOMEM;
root = fuse_get_root_inode(sb, ctx->rootmode);
diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
index 3f786a15b0d9..62cdd6817b5b 100644
--- a/fs/fuse/virtio_fs.c
+++ b/fs/fuse/virtio_fs.c
@@ -10,6 +10,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include "fuse_i.h"
 
@@ -65,6 +66,45 @@ struct virtio_fs_forget {
 static int virtio_fs_enqueue_req(struct virtio_fs_vq *fsvq,
 struct fuse_req *req, bool in_flight);
 
+enum {
+   OPT_DAX,
+};
+
+static const struct fs_parameter_spec virtio_fs_parameters[] = {
+   fsparam_flag("dax", OPT_DAX),
+   {}
+};
+
+static int virtio_fs_parse_param(struct fs_context *fc,
+struct fs_parameter *param)
+{
+   struct fs_parse_result result;
+   struct fuse_fs_context *ctx = fc->fs_private;
+   int opt;
+
+   opt = fs_parse(fc, virtio_fs_parameters, param, );
+   if (opt < 0)
+   return opt;
+
+   switch(opt) {
+   case OPT_DAX:
+   ctx->dax = 1;
+   break;
+   default:
+   return -EINVAL;
+   }
+
+   return 0;
+}
+
+static void virtio_fs_free_fc(struct fs_context *fc)
+{
+   struct fuse_fs_context *ctx = fc->fs_private;
+
+   if (ctx)
+   kfree(ctx);
+}
+
 static inline struct virtio_fs_vq *vq_to_fsvq(struct virtqueue *vq)
 {
struct virtio_fs *fs = vq->vdev->priv;
@@ -1045,23 +1085,27 @@ static const struct fuse_iqueue_ops virtio_fs_fiq_ops = 
{
.release= virtio_fs_fiq_release,
 };
 
-static int virtio_fs_fill_super(struct super_block *sb)
+static inline void virtio_fs_ctx_set_defaults(struct fuse_fs_context *ctx)
+{
+   ctx->rootmode = S_IFDIR;
+   ctx->default_permissions = 1;
+   ctx->allow_other = 1;
+   ctx->max_read = UINT_MAX;
+   ctx->blksize = 512;
+   ctx->destroy = true;
+   ctx->no_control = true;
+   ctx->no_force_umount = true;
+}
+
+static int virtio_fs_fill_super(struct super_block *sb, struct fs_context *fsc)
 {
struct fuse_conn *fc = get_fuse_conn_super(sb);
struct virtio_fs *fs = fc->iq.priv;
+   struct fuse_fs_context *ctx = fsc->fs_private;
unsigned int i;
int err;
-   struct fuse_fs_context ctx = {
-   .rootmode = S_IFDIR,
-   .default_permissions = 1,
-   .allow_other = 1,
-   .max_read = UINT_MAX,
-   .blksize = 512,
-   .destroy = true,
-   .no_control = true,
-   .no_force_umount = true,
-   };
 
+   virtio_fs_ctx_set_defaults(ctx);
mutex_lock(_fs_mutex);
 
/* After holding mutex, make sure virtiofs device is still there.
@@ -1084,8 +1128,10 @@ static int virtio_fs_fill_super(struct super_block *sb)
goto err_free_fuse_d

[PATCH 17/20] fuse,virtiofs: Maintain a list of busy elements

2020-03-04 Thread Vivek Goyal

This list will be used selecting fuse_dax_mapping to free when number of
free mappings drops below a threshold.

Signed-off-by: Vivek Goyal 
---
 fs/fuse/file.c   | 22 ++
 fs/fuse/fuse_i.h |  8 
 fs/fuse/inode.c  |  4 
 3 files changed, 34 insertions(+)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 619aff6b5f44..afabeb1acd50 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -215,6 +215,23 @@ static struct fuse_dax_mapping *alloc_dax_mapping(struct 
fuse_conn *fc)
return dmap;
 }
 
+/* This assumes fc->lock is held */
+static void __dmap_remove_busy_list(struct fuse_conn *fc,
+   struct fuse_dax_mapping *dmap)
+{
+   list_del_init(>busy_list);
+   WARN_ON(fc->nr_busy_ranges == 0);
+   fc->nr_busy_ranges--;
+}
+
+static void dmap_remove_busy_list(struct fuse_conn *fc,
+ struct fuse_dax_mapping *dmap)
+{
+   spin_lock(>lock);
+   __dmap_remove_busy_list(fc, dmap);
+   spin_unlock(>lock);
+}
+
 /* This assumes fc->lock is held */
 static void __dmap_add_to_free_pool(struct fuse_conn *fc,
struct fuse_dax_mapping *dmap)
@@ -277,6 +294,10 @@ static int fuse_setup_one_mapping(struct inode *inode, 
loff_t offset,
/* Protected by fi->i_dmap_sem */
fuse_dax_interval_tree_insert(dmap, >dmap_tree);
fi->nr_dmaps++;
+   spin_lock(>lock);
+   list_add_tail(>busy_list, >busy_ranges);
+   fc->nr_busy_ranges++;
+   spin_unlock(>lock);
}
return 0;
 }
@@ -346,6 +367,7 @@ static void dmap_reinit_add_to_free_pool(struct fuse_conn 
*fc,
pr_debug("fuse: freeing memory range start=0x%llx end=0x%llx "
 "window_offset=0x%llx length=0x%llx\n", dmap->start,
 dmap->end, dmap->window_offset, dmap->length);
+   __dmap_remove_busy_list(fc, dmap);
dmap->start = dmap->end = 0;
__dmap_add_to_free_pool(fc, dmap);
 }
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 3fea84411401..de213a7e1b0e 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -85,6 +85,10 @@ struct fuse_dax_mapping {
/** End Position in file */
__u64 end;
__u64 __subtree_last;
+
+   /* Will connect in fc->busy_ranges to keep track busy memory */
+   struct list_head busy_list;
+
/** Position in DAX window */
u64 window_offset;
 
@@ -814,6 +818,10 @@ struct fuse_conn {
/** DAX device, non-NULL if DAX is supported */
struct dax_device *dax_dev;
 
+   /* List of memory ranges which are busy */
+   unsigned long nr_busy_ranges;
+   struct list_head busy_ranges;
+
/*
 * DAX Window Free Ranges
 */
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index abc881e6acb0..d4770e7fb7eb 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -616,6 +616,8 @@ static void fuse_free_dax_mem_ranges(struct list_head 
*mem_list)
/* Free All allocated elements */
list_for_each_entry_safe(range, temp, mem_list, list) {
list_del(>list);
+   if (!list_empty(>busy_list))
+   list_del(>busy_list);
kfree(range);
}
 }
@@ -660,6 +662,7 @@ static int fuse_dax_mem_range_init(struct fuse_conn *fc,
 */
range->window_offset = i * FUSE_DAX_MEM_RANGE_SZ;
range->length = FUSE_DAX_MEM_RANGE_SZ;
+   INIT_LIST_HEAD(>busy_list);
list_add_tail(>list, _ranges);
}
 
@@ -707,6 +710,7 @@ void fuse_conn_init(struct fuse_conn *fc, struct 
user_namespace *user_ns,
fc->user_ns = get_user_ns(user_ns);
fc->max_pages = FUSE_DEFAULT_MAX_PAGES_PER_REQ;
INIT_LIST_HEAD(>free_ranges);
+   INIT_LIST_HEAD(>busy_ranges);
 }
 EXPORT_SYMBOL_GPL(fuse_conn_init);
 
-- 
2.20.1
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

[PATCH 16/20] fuse,virtiofs: Define dax address space operations

2020-03-04 Thread Vivek Goyal

This is done along the lines of ext4 and xfs. I primarily wanted ->writepages
hook at this time so that I could call into dax_writeback_mapping_range().
This in turn will decide which pfns need to be written back.

Signed-off-by: Vivek Goyal 
---
 fs/fuse/file.c | 18 ++
 1 file changed, 18 insertions(+)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index ab56396cf661..619aff6b5f44 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -2696,6 +2696,16 @@ static int fuse_writepages_fill(struct page *page,
return err;
 }
 
+static int fuse_dax_writepages(struct address_space *mapping,
+   struct writeback_control *wbc)
+{
+
+   struct inode *inode = mapping->host;
+   struct fuse_conn *fc = get_fuse_conn(inode);
+
+   return dax_writeback_mapping_range(mapping, fc->dax_dev, wbc);
+}
+
 static int fuse_writepages(struct address_space *mapping,
   struct writeback_control *wbc)
 {
@@ -4032,6 +4042,13 @@ static const struct address_space_operations 
fuse_file_aops  = {
.write_end  = fuse_write_end,
 };
 
+static const struct address_space_operations fuse_dax_file_aops  = {
+   .writepages = fuse_dax_writepages,
+   .direct_IO  = noop_direct_IO,
+   .set_page_dirty = noop_set_page_dirty,
+   .invalidatepage = noop_invalidatepage,
+};
+
 void fuse_init_file_inode(struct inode *inode)
 {
struct fuse_inode *fi = get_fuse_inode(inode);
@@ -4049,5 +4066,6 @@ void fuse_init_file_inode(struct inode *inode)
 
if (fc->dax_dev) {
inode->i_flags |= S_DAX;
+   inode->i_data.a_ops = _dax_file_aops;
}
 }
-- 
2.20.1
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

[PATCH 00/20] virtiofs: Add DAX support

2020-03-04 Thread Vivek Goyal

nc 458(MiB/s)

virtiofs-cache-none randwrite-psync-multi   213(MiB/s)
virtiofs-cache-none-dax randwrite-psync-multi   1343(MiB/s)

virtiofs-cache-none randwrite-mmap  0(KiB/s)
virtiofs-cache-none-dax randwrite-mmap  663(MiB/s)

virtiofs-cache-none randwrite-mmap-multi0(KiB/s)
virtiofs-cache-none-dax randwrite-mmap-multi1820(MiB/s)

virtiofs-cache-none randwrite-libaio292(MiB/s)
virtiofs-cache-none-dax randwrite-libaio341(MiB/s)

virtiofs-cache-none randwrite-libaio-multi  322(MiB/s)
virtiofs-cache-none-dax randwrite-libaio-multi  1094(MiB/s)

Conclusion
===
- virtio-fs with dax enabled is significantly faster and memory
  effiecient as comapred to non-dax operation.

Note:
  Right now dax window is 64G and max fio file size is 32G as well (4
  files of 8G each). That means everything fits into dax window and no
  reclaim is needed. Dax window reclaim logic is slower and if file
  size is bigger than dax window size, performance slows down.

Thanks
Vivek

Sebastien Boeuf (3):
  virtio: Add get_shm_region method
  virtio: Implement get_shm_region for PCI transport
  virtio: Implement get_shm_region for MMIO transport

Stefan Hajnoczi (2):
  virtio_fs, dax: Set up virtio_fs dax_device
  fuse,dax: add DAX mmap support

Vivek Goyal (15):
  dax: Modify bdev_dax_pgoff() to handle NULL bdev
  dax: Create a range version of dax_layout_busy_page()
  virtiofs: Provide a helper function for virtqueue initialization
  fuse: Get rid of no_mount_options
  fuse,virtiofs: Add a mount option to enable dax
  fuse,virtiofs: Keep a list of free dax memory ranges
  fuse: implement FUSE_INIT map_alignment field
  fuse: Introduce setupmapping/removemapping commands
  fuse, dax: Implement dax read/write operations
  fuse, dax: Take ->i_mmap_sem lock during dax page fault
  fuse,virtiofs: Define dax address space operations
  fuse,virtiofs: Maintain a list of busy elements
  fuse: Release file in process context
  fuse: Take inode lock for dax inode truncation
  fuse,virtiofs: Add logic to free up a memory range

 drivers/dax/super.c|3 +-
 drivers/virtio/virtio_mmio.c   |   32 +
 drivers/virtio/virtio_pci_modern.c |  107 +++
 fs/dax.c   |   66 +-
 fs/fuse/dir.c  |2 +
 fs/fuse/file.c | 1162 +++-
 fs/fuse/fuse_i.h   |  109 ++-
 fs/fuse/inode.c|  148 +++-
 fs/fuse/virtio_fs.c|  250 +-
 include/linux/dax.h|6 +
 include/linux/virtio_config.h  |   17 +
 include/uapi/linux/fuse.h  |   42 +-
 include/uapi/linux/virtio_fs.h |3 +
 include/uapi/linux/virtio_mmio.h   |   11 +
 include/uapi/linux/virtio_pci.h|   11 +-
 15 files changed, 1888 insertions(+), 81 deletions(-)

-- 
2.20.1
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

[PATCH 19/20] fuse: Take inode lock for dax inode truncation

2020-03-04 Thread Vivek Goyal

When a file is opened with O_TRUNC, we need to make sure that any other
DAX operation is not in progress. DAX expects i_size to be stable.

In fuse_iomap_begin() we check for i_size at multiple places and we expect
i_size to not change.

Another problem is, if we setup a mapping in fuse_iomap_begin(), and
file gets truncated and dax read/write happens, KVM currently hangs.
It tries to fault in a page which does not exist on host (file got
truncated). It probably requries fixing in KVM.

So for now, take inode lock. Once KVM is fixed, we might have to
have a look at it again.

Signed-off-by: Vivek Goyal 
---
 fs/fuse/file.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 561428b66101..8b264fcb9b3c 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -483,7 +483,7 @@ int fuse_open_common(struct inode *inode, struct file 
*file, bool isdir)
int err;
bool is_wb_truncate = (file->f_flags & O_TRUNC) &&
  fc->atomic_o_trunc &&
- fc->writeback_cache;
+ (fc->writeback_cache || IS_DAX(inode));
 
err = generic_file_open(inode, file);
if (err)
-- 
2.20.1
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

[PATCH 13/20] fuse, dax: Implement dax read/write operations

2020-03-04 Thread Vivek Goyal

This patch implements basic DAX support. mmap() is not implemented
yet and will come in later patches. This patch looks into implemeting
read/write.

We make use of interval tree to keep track of per inode dax mappings.

Do not use dax for file extending writes, instead just send WRITE message
to daemon (like we do for direct I/O path). This will keep write and
i_size change atomic w.r.t crash.

Signed-off-by: Stefan Hajnoczi 
Signed-off-by: Dr. David Alan Gilbert 
Signed-off-by: Vivek Goyal 
Signed-off-by: Miklos Szeredi 
Signed-off-by: Liu Bo 
Signed-off-by: Peng Tao 
---
 fs/fuse/file.c| 597 +-
 fs/fuse/fuse_i.h  |  23 ++
 fs/fuse/inode.c   |   6 +
 include/uapi/linux/fuse.h |   1 +
 4 files changed, 621 insertions(+), 6 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 9d67b830fb7a..9effdd3dc6d6 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -18,6 +18,12 @@
 #include 
 #include 
 #include 
+#include 
+#include 
+#include 
+
+INTERVAL_TREE_DEFINE(struct fuse_dax_mapping, rb, __u64, __subtree_last,
+ START, LAST, static inline, fuse_dax_interval_tree);
 
 static struct page **fuse_pages_alloc(unsigned int npages, gfp_t flags,
  struct fuse_page_desc **desc)
@@ -187,6 +193,242 @@ static void fuse_link_write_file(struct file *file)
spin_unlock(>lock);
 }
 
+static struct fuse_dax_mapping *alloc_dax_mapping(struct fuse_conn *fc)
+{
+   struct fuse_dax_mapping *dmap = NULL;
+
+   spin_lock(>lock);
+
+   if (fc->nr_free_ranges <= 0) {
+   spin_unlock(>lock);
+   return NULL;
+   }
+
+   WARN_ON(list_empty(>free_ranges));
+
+   /* Take a free range */
+   dmap = list_first_entry(>free_ranges, struct fuse_dax_mapping,
+   list);
+   list_del_init(>list);
+   fc->nr_free_ranges--;
+   spin_unlock(>lock);
+   return dmap;
+}
+
+/* This assumes fc->lock is held */
+static void __dmap_add_to_free_pool(struct fuse_conn *fc,
+   struct fuse_dax_mapping *dmap)
+{
+   list_add_tail(>list, >free_ranges);
+   fc->nr_free_ranges++;
+}
+
+static void dmap_add_to_free_pool(struct fuse_conn *fc,
+   struct fuse_dax_mapping *dmap)
+{
+   /* Return fuse_dax_mapping to free list */
+   spin_lock(>lock);
+   __dmap_add_to_free_pool(fc, dmap);
+   spin_unlock(>lock);
+}
+
+/* offset passed in should be aligned to FUSE_DAX_MEM_RANGE_SZ */
+static int fuse_setup_one_mapping(struct inode *inode, loff_t offset,
+ struct fuse_dax_mapping *dmap, bool writable,
+ bool upgrade)
+{
+   struct fuse_conn *fc = get_fuse_conn(inode);
+   struct fuse_inode *fi = get_fuse_inode(inode);
+   struct fuse_setupmapping_in inarg;
+   FUSE_ARGS(args);
+   ssize_t err;
+
+   WARN_ON(offset % FUSE_DAX_MEM_RANGE_SZ);
+   WARN_ON(fc->nr_free_ranges < 0);
+
+   /* Ask fuse daemon to setup mapping */
+   memset(, 0, sizeof(inarg));
+   inarg.foffset = offset;
+   inarg.fh = -1;
+   inarg.moffset = dmap->window_offset;
+   inarg.len = FUSE_DAX_MEM_RANGE_SZ;
+   inarg.flags |= FUSE_SETUPMAPPING_FLAG_READ;
+   if (writable)
+   inarg.flags |= FUSE_SETUPMAPPING_FLAG_WRITE;
+   args.opcode = FUSE_SETUPMAPPING;
+   args.nodeid = fi->nodeid;
+   args.in_numargs = 1;
+   args.in_args[0].size = sizeof(inarg);
+   args.in_args[0].value = 
+   err = fuse_simple_request(fc, );
+   if (err < 0) {
+   printk(KERN_ERR "%s request failed at mem_offset=0x%llx %zd\n",
+__func__, dmap->window_offset, err);
+   return err;
+   }
+
+   pr_debug("fuse_setup_one_mapping() succeeded. offset=0x%llx writable=%d"
+" err=%zd\n", offset, writable, err);
+
+   dmap->writable = writable;
+   if (!upgrade) {
+   dmap->start = offset;
+   dmap->end = offset + FUSE_DAX_MEM_RANGE_SZ - 1;
+   /* Protected by fi->i_dmap_sem */
+   fuse_dax_interval_tree_insert(dmap, >dmap_tree);
+   fi->nr_dmaps++;
+   }
+   return 0;
+}
+
+static int
+fuse_send_removemapping(struct inode *inode,
+   struct fuse_removemapping_in *inargp,
+   struct fuse_removemapping_one *remove_one)
+{
+   struct fuse_inode *fi = get_fuse_inode(inode);
+   struct fuse_conn *fc = get_fuse_conn(inode);
+   FUSE_ARGS(args);
+
+   args.opcode = FUSE_REMOVEMAPPING;
+   args.nodeid = fi->nodeid;
+   args.in_numargs = 2;
+   args.in_args[0].size = sizeof(*inargp);
+   args.in_args[0].value = i

[PATCH 10/20] fuse,virtiofs: Keep a list of free dax memory ranges

2020-03-04 Thread Vivek Goyal

Divide the dax memory range into fixed size ranges (2MB for now) and put
them in a list. This will track free ranges. Once an inode requires a
free range, we will take one from here and put it in interval-tree
of ranges assigned to inode.

Signed-off-by: Vivek Goyal 
Signed-off-by: Peng Tao 
---
 fs/fuse/fuse_i.h| 22 
 fs/fuse/inode.c | 88 -
 fs/fuse/virtio_fs.c |  2 ++
 3 files changed, 111 insertions(+), 1 deletion(-)

diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 1fe5065a2902..edd3136c11f7 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -47,6 +47,10 @@
 /** Number of dentries for each connection in the control filesystem */
 #define FUSE_CTL_NUM_DENTRIES 5
 
+/* Default memory range size, 2MB */
+#define FUSE_DAX_MEM_RANGE_SZ  (2*1024*1024)
+#define FUSE_DAX_MEM_RANGE_PAGES   (FUSE_DAX_MEM_RANGE_SZ/PAGE_SIZE)
+
 /** List of active connections */
 extern struct list_head fuse_conn_list;
 
@@ -63,6 +67,18 @@ struct fuse_forget_link {
struct fuse_forget_link *next;
 };
 
+/** Translation information for file offsets to DAX window offsets */
+struct fuse_dax_mapping {
+   /* Will connect in fc->free_ranges to keep track of free memory */
+   struct list_head list;
+
+   /** Position in DAX window */
+   u64 window_offset;
+
+   /** Length of mapping, in bytes */
+   loff_t length;
+};
+
 /** FUSE inode */
 struct fuse_inode {
/** Inode data */
@@ -765,6 +781,12 @@ struct fuse_conn {
 
/** DAX device, non-NULL if DAX is supported */
struct dax_device *dax_dev;
+
+   /*
+* DAX Window Free Ranges
+*/
+   long nr_free_ranges;
+   struct list_head free_ranges;
 };
 
 static inline struct fuse_conn *get_fuse_conn_super(struct super_block *sb)
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 84295fac4ff3..0ba092bf0b6d 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -23,6 +23,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 
 MODULE_AUTHOR("Miklos Szeredi ");
 MODULE_DESCRIPTION("Filesystem in Userspace");
@@ -600,6 +602,76 @@ static void fuse_pqueue_init(struct fuse_pqueue *fpq)
fpq->connected = 1;
 }
 
+static void fuse_free_dax_mem_ranges(struct list_head *mem_list)
+{
+   struct fuse_dax_mapping *range, *temp;
+
+   /* Free All allocated elements */
+   list_for_each_entry_safe(range, temp, mem_list, list) {
+   list_del(>list);
+   kfree(range);
+   }
+}
+
+#ifdef CONFIG_FS_DAX
+static int fuse_dax_mem_range_init(struct fuse_conn *fc,
+  struct dax_device *dax_dev)
+{
+   long nr_pages, nr_ranges;
+   void *kaddr;
+   pfn_t pfn;
+   struct fuse_dax_mapping *range;
+   LIST_HEAD(mem_ranges);
+   phys_addr_t phys_addr;
+   int ret = 0, id;
+   size_t dax_size = -1;
+   unsigned long i;
+
+   id = dax_read_lock();
+   nr_pages = dax_direct_access(dax_dev, 0, PHYS_PFN(dax_size), ,
+   );
+   dax_read_unlock(id);
+   if (nr_pages < 0) {
+   pr_debug("dax_direct_access() returned %ld\n", nr_pages);
+   return nr_pages;
+   }
+
+   phys_addr = pfn_t_to_phys(pfn);
+   nr_ranges = nr_pages/FUSE_DAX_MEM_RANGE_PAGES;
+   printk("fuse_dax_mem_range_init(): dax mapped %ld pages. 
nr_ranges=%ld\n", nr_pages, nr_ranges);
+
+   for (i = 0; i < nr_ranges; i++) {
+   range = kzalloc(sizeof(struct fuse_dax_mapping), GFP_KERNEL);
+   if (!range) {
+   pr_debug("memory allocation for mem_range failed.\n");
+   ret = -ENOMEM;
+   goto out_err;
+   }
+   /* TODO: This offset only works if virtio-fs driver is not
+* having some memory hidden at the beginning. This needs
+* better handling
+*/
+   range->window_offset = i * FUSE_DAX_MEM_RANGE_SZ;
+   range->length = FUSE_DAX_MEM_RANGE_SZ;
+   list_add_tail(>list, _ranges);
+   }
+
+   list_replace_init(_ranges, >free_ranges);
+   fc->nr_free_ranges = nr_ranges;
+   return 0;
+out_err:
+   /* Free All allocated elements */
+   fuse_free_dax_mem_ranges(_ranges);
+   return ret;
+}
+#else /* !CONFIG_FS_DAX */
+static inline int fuse_dax_mem_range_init(struct fuse_conn *fc,
+ struct dax_device *dax_dev)
+{
+   return 0;
+}
+#endif /* CONFIG_FS_DAX */
+
 void fuse_conn_init(struct fuse_conn *fc, struct user_namespace *user_ns,
const struct fuse_iqueue_ops *fiq_ops, void *fiq_priv)
 {
@@ -627,6 +699,7 @@ void fuse_conn_init(struct fuse_conn *fc, struct 
user_namespace *user_ns,
fc->pid_ns = get_pid_ns(task_active_

[PATCH 11/20] fuse: implement FUSE_INIT map_alignment field

2020-03-04 Thread Vivek Goyal

The device communicates FUSE_SETUPMAPPING/FUSE_REMOVMAPPING alignment
constraints via the FUST_INIT map_alignment field.  Parse this field and
ensure our DAX mappings meet the alignment constraints.

We don't actually align anything differently since our mappings are
already 2MB aligned.  Just check the value when the connection is
established.  If it becomes necessary to honor arbitrary alignments in
the future we'll have to adjust how mappings are sized.

The upshot of this commit is that we can be confident that mappings will
work even when emulating x86 on Power and similar combinations where the
host page sizes are different.

Signed-off-by: Stefan Hajnoczi 
Signed-off-by: Vivek Goyal 
---
 fs/fuse/fuse_i.h  |  5 -
 fs/fuse/inode.c   | 19 +--
 include/uapi/linux/fuse.h |  4 +++-
 3 files changed, 24 insertions(+), 4 deletions(-)

diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index edd3136c11f7..b41275f73e4c 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -47,7 +47,10 @@
 /** Number of dentries for each connection in the control filesystem */
 #define FUSE_CTL_NUM_DENTRIES 5
 
-/* Default memory range size, 2MB */
+/*
+ * Default memory range size.  A power of 2 so it agrees with common FUSE_INIT
+ * map_alignment values 4KB and 64KB.
+ */
 #define FUSE_DAX_MEM_RANGE_SZ  (2*1024*1024)
 #define FUSE_DAX_MEM_RANGE_PAGES   (FUSE_DAX_MEM_RANGE_SZ/PAGE_SIZE)
 
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 0ba092bf0b6d..36cb9c00bbe5 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -961,9 +961,10 @@ static void process_init_reply(struct fuse_conn *fc, 
struct fuse_args *args,
 {
struct fuse_init_args *ia = container_of(args, typeof(*ia), args);
struct fuse_init_out *arg = >out;
+   bool ok = true;
 
if (error || arg->major != FUSE_KERNEL_VERSION)
-   fc->conn_error = 1;
+   ok = false;
else {
unsigned long ra_pages;
 
@@ -1026,6 +1027,14 @@ static void process_init_reply(struct fuse_conn *fc, 
struct fuse_args *args,
min_t(unsigned int, FUSE_MAX_MAX_PAGES,
max_t(unsigned int, arg->max_pages, 1));
}
+   if ((arg->flags & FUSE_MAP_ALIGNMENT) &&
+   (FUSE_DAX_MEM_RANGE_SZ % (1ul << 
arg->map_alignment))) {
+   printk(KERN_ERR "FUSE: map_alignment %u"
+  " incompatible with dax mem range size"
+  " %u\n", arg->map_alignment,
+  FUSE_DAX_MEM_RANGE_SZ);
+   ok = false;
+   }
} else {
ra_pages = fc->max_read / PAGE_SIZE;
fc->no_lock = 1;
@@ -1041,6 +1050,11 @@ static void process_init_reply(struct fuse_conn *fc, 
struct fuse_args *args,
}
kfree(ia);
 
+   if (!ok) {
+   fc->conn_init = 0;
+   fc->conn_error = 1;
+   }
+
fuse_set_initialized(fc);
wake_up_all(>blocked_waitq);
 }
@@ -1063,7 +1077,8 @@ void fuse_send_init(struct fuse_conn *fc)
FUSE_WRITEBACK_CACHE | FUSE_NO_OPEN_SUPPORT |
FUSE_PARALLEL_DIROPS | FUSE_HANDLE_KILLPRIV | FUSE_POSIX_ACL |
FUSE_ABORT_ERROR | FUSE_MAX_PAGES | FUSE_CACHE_SYMLINKS |
-   FUSE_NO_OPENDIR_SUPPORT | FUSE_EXPLICIT_INVAL_DATA;
+   FUSE_NO_OPENDIR_SUPPORT | FUSE_EXPLICIT_INVAL_DATA |
+   FUSE_MAP_ALIGNMENT;
ia->args.opcode = FUSE_INIT;
ia->args.in_numargs = 1;
ia->args.in_args[0].size = sizeof(ia->in);
diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
index 373cada89815..5b85819e045f 100644
--- a/include/uapi/linux/fuse.h
+++ b/include/uapi/linux/fuse.h
@@ -313,7 +313,9 @@ struct fuse_file_lock {
  * FUSE_CACHE_SYMLINKS: cache READLINK responses
  * FUSE_NO_OPENDIR_SUPPORT: kernel supports zero-message opendir
  * FUSE_EXPLICIT_INVAL_DATA: only invalidate cached pages on explicit request
- * FUSE_MAP_ALIGNMENT: map_alignment field is valid
+ * FUSE_MAP_ALIGNMENT: init_out.map_alignment contains log2(byte alignment) for
+ *foffset and moffset fields in struct
+ *fuse_setupmapping_out and fuse_removemapping_one.
  */
 #define FUSE_ASYNC_READ(1 << 0)
 #define FUSE_POSIX_LOCKS   (1 << 1)
-- 
2.20.1
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

[PATCH 04/20] virtio: Implement get_shm_region for PCI transport

2020-03-04 Thread Vivek Goyal

From: Sebastien Boeuf 

On PCI the shm regions are found using capability entries;
find a region by searching for the capability.

Signed-off-by: Sebastien Boeuf 
Signed-off-by: Dr. David Alan Gilbert 
Signed-off-by: kbuild test robot 
---
 drivers/virtio/virtio_pci_modern.c | 107 +
 include/uapi/linux/virtio_pci.h|  11 ++-
 2 files changed, 117 insertions(+), 1 deletion(-)

diff --git a/drivers/virtio/virtio_pci_modern.c 
b/drivers/virtio/virtio_pci_modern.c
index 7abcc50838b8..52f179411015 100644
--- a/drivers/virtio/virtio_pci_modern.c
+++ b/drivers/virtio/virtio_pci_modern.c
@@ -443,6 +443,111 @@ static void del_vq(struct virtio_pci_vq_info *info)
vring_del_virtqueue(vq);
 }
 
+static int virtio_pci_find_shm_cap(struct pci_dev *dev,
+   u8 required_id,
+   u8 *bar, u64 *offset, u64 *len)
+{
+   int pos;
+
+for (pos = pci_find_capability(dev, PCI_CAP_ID_VNDR);
+ pos > 0;
+ pos = pci_find_next_capability(dev, pos, PCI_CAP_ID_VNDR)) {
+   u8 type, cap_len, id;
+u32 tmp32;
+u64 res_offset, res_length;
+
+   pci_read_config_byte(dev, pos + offsetof(struct virtio_pci_cap,
+ cfg_type),
+ );
+if (type != VIRTIO_PCI_CAP_SHARED_MEMORY_CFG)
+continue;
+
+   pci_read_config_byte(dev, pos + offsetof(struct virtio_pci_cap,
+ cap_len),
+ _len);
+   if (cap_len != sizeof(struct virtio_pci_cap64)) {
+   printk(KERN_ERR "%s: shm cap with bad size offset: %d 
size: %d\n",
+   __func__, pos, cap_len);
+continue;
+}
+
+   pci_read_config_byte(dev, pos + offsetof(struct virtio_pci_cap,
+ id),
+ );
+if (id != required_id)
+continue;
+
+/* Type, and ID match, looks good */
+pci_read_config_byte(dev, pos + offsetof(struct virtio_pci_cap,
+ bar),
+ bar);
+
+/* Read the lower 32bit of length and offset */
+pci_read_config_dword(dev, pos + offsetof(struct 
virtio_pci_cap, offset),
+  );
+res_offset = tmp32;
+pci_read_config_dword(dev, pos + offsetof(struct 
virtio_pci_cap, length),
+  );
+res_length = tmp32;
+
+/* and now the top half */
+pci_read_config_dword(dev,
+  pos + offsetof(struct virtio_pci_cap64,
+ offset_hi),
+  );
+res_offset |= ((u64)tmp32) << 32;
+pci_read_config_dword(dev,
+  pos + offsetof(struct virtio_pci_cap64,
+ length_hi),
+  );
+res_length |= ((u64)tmp32) << 32;
+
+*offset = res_offset;
+*len = res_length;
+
+return pos;
+}
+return 0;
+}
+
+static bool vp_get_shm_region(struct virtio_device *vdev,
+ struct virtio_shm_region *region, u8 id)
+{
+   struct virtio_pci_device *vp_dev = to_vp_device(vdev);
+   struct pci_dev *pci_dev = vp_dev->pci_dev;
+   u8 bar;
+   u64 offset, len;
+   phys_addr_t phys_addr;
+   size_t bar_len;
+   int ret;
+
+   if (!virtio_pci_find_shm_cap(pci_dev, id, , , )) {
+   return false;
+   }
+
+   ret = pci_request_region(pci_dev, bar, "virtio-pci-shm");
+   if (ret < 0) {
+   dev_err(_dev->dev, "%s: failed to request BAR\n",
+   __func__);
+   return false;
+   }
+
+   phys_addr = pci_resource_start(pci_dev, bar);
+   bar_len = pci_resource_len(pci_dev, bar);
+
+if (offset + len > bar_len) {
+dev_err(_dev->dev,
+"%s: bar shorter than cap offset+len\n",
+__func__);
+return false;
+}
+
+   region->len = len;
+   region->addr = (u64) phys_addr + offset;
+
+   return true;
+}
+
 static const struct virtio_config_ops virtio_pci_config_nodev_ops = {
.get= NULL,
.set= NULL,
@@ -457,6 +562,7 @@ static const struct virtio_config_ops 
virtio_pci_config_nodev_ops = {
.bus_name

[PATCH 01/20] dax: Modify bdev_dax_pgoff() to handle NULL bdev

2020-03-04 Thread Vivek Goyal

virtiofs does not have a block device. Modify bdev_dax_pgoff() to be
able to handle that.

If there is no bdev, that means dax offset is 0. (It can't be a partition
block device starting at an offset in dax device).

Signed-off-by: Stefan Hajnoczi 
Signed-off-by: Vivek Goyal 
---
 drivers/dax/super.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 0aa4b6bc5101..c34f21f2f199 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -46,7 +46,8 @@ EXPORT_SYMBOL_GPL(dax_read_unlock);
 int bdev_dax_pgoff(struct block_device *bdev, sector_t sector, size_t size,
pgoff_t *pgoff)
 {
-   phys_addr_t phys_off = (get_start_sect(bdev) + sector) * 512;
+   sector_t start_sect = bdev ? get_start_sect(bdev) : 0;
+   phys_addr_t phys_off = (start_sect + sector) * 512;
 
if (pgoff)
*pgoff = PHYS_PFN(phys_off);
-- 
2.20.1
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

1 2 3 >

1 - 100 of 253 matches

Mail list logo