Re: RDMA/rpma + fsdax(ext4) was broken since 36f30e486d

2021-08-27 Thread Jason Gunthorpe
On Fri, Aug 27, 2021 at 09:42:21AM -0700, Dan Williams wrote:
> On Fri, Aug 27, 2021 at 6:05 AM Li, Zhijian  wrote:
> >
> >
> > on 2021/8/27 20:10, Jason Gunthorpe wrote:
> > > On Fri, Aug 27, 2021 at 08:15:40AM +, lizhij...@fujitsu.com wrote:
> > >> i looked over the change-log of hmm_vma_handle_pte(), and found that 
> > >> before
> > >> 4055062 ("mm/hmm: add missing call to hmm_pte_need_fault in 
> > >> HMM_PFN_SPECIAL handling")
> > >>
> > >> hmm_vma_handle_pte() will not check pte_special(pte) if pte_devmap(pte) 
> > >> is true.
> > >>
> > >> when we reached
> > >> "if (pte_special(pte) && !is_zero_pfn(pte_pfn(pte))) {"
> > >> the pte have already presented and its pte's flag already fulfilled the 
> > >> request flags.
> > >>
> > >>
> > >> My question is that
> > >> Per https://01.org/blogs/dave/2020/linux-consumption-x86-page-table-bits,
> > >> pte_devmap(pte) and pte_special(pte) could be both true in fsdax user 
> > >> case, right ?
> > > How? what code creates that?
> > >
> > > I see:
> > >
> > > insert_pfn():
> > >   /* Ok, finally just insert the thing.. */
> > >   if (pfn_t_devmap(pfn))
> > >   entry = pte_mkdevmap(pfn_t_pte(pfn, prot));
> > >   else
> > >   entry = pte_mkspecial(pfn_t_pte(pfn, prot));
> > >
> > > So what code path ends up setting both bits?
> >
> >   pte_mkdevmap() will set both _PAGE_SPECIAL | PAGE_DEVMAP
> >
> >   395 static inline pte_t pte_mkdevmap(pte_t pte)
> >   396 {
> >   397 return pte_set_flags(pte, _PAGE_SPECIAL|_PAGE_DEVMAP);
> >   398 }
> 
> I can't recall why _PAGE_SPECIAL is there. I'll take a look, but I
> think setting _PAGE_SPECIAL in pte_mkdevmap() is overkill.

This is my feeling too, but every arch does it, so hmm should check
it, at least for now as a stable fix

devmap has a struct page so it should be refcounted inside the VMA and
that is the main thing that PAGE_SPECIAL disabled, AFAICR..

The only places where pte_special are used that I wonder if are OK for
devmap have to do with CPU cache maintenance

vm_normal_page(), hmm_vma_handle_pte(), gup_pte_range() all look OK to
drop the special bit

Jason



Re: RDMA/rpma + fsdax(ext4) was broken since 36f30e486d

2021-08-27 Thread Dan Williams
On Fri, Aug 27, 2021 at 6:05 AM Li, Zhijian  wrote:
>
>
> on 2021/8/27 20:10, Jason Gunthorpe wrote:
> > On Fri, Aug 27, 2021 at 08:15:40AM +, lizhij...@fujitsu.com wrote:
> >> i looked over the change-log of hmm_vma_handle_pte(), and found that before
> >> 4055062 ("mm/hmm: add missing call to hmm_pte_need_fault in 
> >> HMM_PFN_SPECIAL handling")
> >>
> >> hmm_vma_handle_pte() will not check pte_special(pte) if pte_devmap(pte) is 
> >> true.
> >>
> >> when we reached
> >> "if (pte_special(pte) && !is_zero_pfn(pte_pfn(pte))) {"
> >> the pte have already presented and its pte's flag already fulfilled the 
> >> request flags.
> >>
> >>
> >> My question is that
> >> Per https://01.org/blogs/dave/2020/linux-consumption-x86-page-table-bits,
> >> pte_devmap(pte) and pte_special(pte) could be both true in fsdax user 
> >> case, right ?
> > How? what code creates that?
> >
> > I see:
> >
> > insert_pfn():
> >   /* Ok, finally just insert the thing.. */
> >   if (pfn_t_devmap(pfn))
> >   entry = pte_mkdevmap(pfn_t_pte(pfn, prot));
> >   else
> >   entry = pte_mkspecial(pfn_t_pte(pfn, prot));
> >
> > So what code path ends up setting both bits?
>
>   pte_mkdevmap() will set both _PAGE_SPECIAL | PAGE_DEVMAP
>
>   395 static inline pte_t pte_mkdevmap(pte_t pte)
>   396 {
>   397 return pte_set_flags(pte, _PAGE_SPECIAL|_PAGE_DEVMAP);
>   398 }

I can't recall why _PAGE_SPECIAL is there. I'll take a look, but I
think setting _PAGE_SPECIAL in pte_mkdevmap() is overkill.



Re: RDMA/rpma + fsdax(ext4) was broken since 36f30e486d

2021-08-27 Thread Li, Zhijian



on 2021/8/27 21:16, Jason Gunthorpe wrote:

On Fri, Aug 27, 2021 at 09:05:21PM +0800, Li, Zhijian wrote:

Yes, can you send a proper patch and include the mm mailing list?


Of course, my pleasure

Thanks






Re: RDMA/rpma + fsdax(ext4) was broken since 36f30e486d

2021-08-27 Thread Jason Gunthorpe
On Fri, Aug 27, 2021 at 09:05:21PM +0800, Li, Zhijian wrote:
> 
> on 2021/8/27 20:10, Jason Gunthorpe wrote:
> > On Fri, Aug 27, 2021 at 08:15:40AM +, lizhij...@fujitsu.com wrote:
> > > i looked over the change-log of hmm_vma_handle_pte(), and found that 
> > > before
> > > 4055062 ("mm/hmm: add missing call to hmm_pte_need_fault in 
> > > HMM_PFN_SPECIAL handling")
> > > 
> > > hmm_vma_handle_pte() will not check pte_special(pte) if pte_devmap(pte) 
> > > is true.
> > > 
> > > when we reached
> > > "if (pte_special(pte) && !is_zero_pfn(pte_pfn(pte))) {"
> > > the pte have already presented and its pte's flag already fulfilled the 
> > > request flags.
> > > 
> > > 
> > > My question is that
> > > Per https://01.org/blogs/dave/2020/linux-consumption-x86-page-table-bits,
> > > pte_devmap(pte) and pte_special(pte) could be both true in fsdax user 
> > > case, right ?
> > How? what code creates that?
> > 
> > I see:
> > 
> > insert_pfn():
> > /* Ok, finally just insert the thing.. */
> > if (pfn_t_devmap(pfn))
> > entry = pte_mkdevmap(pfn_t_pte(pfn, prot));
> > else
> > entry = pte_mkspecial(pfn_t_pte(pfn, prot));
> > 
> > So what code path ends up setting both bits?
> 
>  pte_mkdevmap() will set both _PAGE_SPECIAL | PAGE_DEVMAP
> 
>  395 static inline pte_t pte_mkdevmap(pte_t pte)
>  396 {
>  397 return pte_set_flags(pte, _PAGE_SPECIAL|_PAGE_DEVMAP);
>  398 }
> 
> below is a calltrace example
> 
> [  400.728559] Call Trace:
> [  400.731595] dump_stack+0x6d/0x8b
> [  400.735536] insert_pfn+0x16c/0x180
> [  400.739596] __vm_insert_mixed+0x84/0xc0
> [  400.744144] dax_iomap_pte_fault+0x845/0x870
> [  400.749089] ext4_dax_huge_fault+0x171/0x1e0
> [  400.754096] __do_fault+0x31/0xe0
> [  400.758090]  ? pmd_devmap_trans_unstable+0x37/0x90
> [  400.763541] handle_mm_fault+0x11b1/0x1680
> [  400.768260] exc_page_fault+0x2f4/0x570
> [  400.772788]  ? asm_exc_page_fault+0x8/0x30
> [  400.777539]  asm_exc_page_fault+0x1e/0x30
> 
> 
> So is my previous change reasonable ?

Yes, can you send a proper patch and include the mm mailing list?

Jason



Re: RDMA/rpma + fsdax(ext4) was broken since 36f30e486d

2021-08-27 Thread Li, Zhijian



on 2021/8/27 20:10, Jason Gunthorpe wrote:

On Fri, Aug 27, 2021 at 08:15:40AM +, lizhij...@fujitsu.com wrote:

i looked over the change-log of hmm_vma_handle_pte(), and found that before
4055062 ("mm/hmm: add missing call to hmm_pte_need_fault in HMM_PFN_SPECIAL 
handling")

hmm_vma_handle_pte() will not check pte_special(pte) if pte_devmap(pte) is true.

when we reached
"if (pte_special(pte) && !is_zero_pfn(pte_pfn(pte))) {"
the pte have already presented and its pte's flag already fulfilled the request 
flags.


My question is that
Per https://01.org/blogs/dave/2020/linux-consumption-x86-page-table-bits,
pte_devmap(pte) and pte_special(pte) could be both true in fsdax user case, 
right ?

How? what code creates that?

I see:

insert_pfn():
/* Ok, finally just insert the thing.. */
if (pfn_t_devmap(pfn))
entry = pte_mkdevmap(pfn_t_pte(pfn, prot));
else
entry = pte_mkspecial(pfn_t_pte(pfn, prot));

So what code path ends up setting both bits?


 pte_mkdevmap() will set both _PAGE_SPECIAL | PAGE_DEVMAP

 395 static inline pte_t pte_mkdevmap(pte_t pte)
 396 {
 397 return pte_set_flags(pte, _PAGE_SPECIAL|_PAGE_DEVMAP);
 398 }

below is a calltrace example

[  400.728559] Call Trace:
[  400.731595] dump_stack+0x6d/0x8b
[  400.735536] insert_pfn+0x16c/0x180
[  400.739596] __vm_insert_mixed+0x84/0xc0
[  400.744144] dax_iomap_pte_fault+0x845/0x870
[  400.749089] ext4_dax_huge_fault+0x171/0x1e0
[  400.754096] __do_fault+0x31/0xe0
[  400.758090]  ? pmd_devmap_trans_unstable+0x37/0x90
[  400.763541] handle_mm_fault+0x11b1/0x1680
[  400.768260] exc_page_fault+0x2f4/0x570
[  400.772788]  ? asm_exc_page_fault+0x8/0x30
[  400.777539]  asm_exc_page_fault+0x1e/0x30


So is my previous change reasonable ?

Thanks

Zhijian






Re: [RFC v2 1/2] virtio-pmem: Async virtio-pmem flush

2021-08-27 Thread Pankaj Gupta
> > > > > > Implement asynchronous flush for virtio pmem using work queue
> > > > > > to solve the preflush ordering issue. Also, coalesce the flush
> > > > > > requests when a flush is already in process.
> > > > > >
> > > > > > Signed-off-by: Pankaj Gupta 
> > > > > > ---
> > > > > >  drivers/nvdimm/nd_virtio.c   | 72 
> > > > > > 
> > > > > >  drivers/nvdimm/virtio_pmem.c | 10 -
> > > > > >  drivers/nvdimm/virtio_pmem.h | 14 +++
> > > > > >  3 files changed, 79 insertions(+), 17 deletions(-)
> > > > > >
> > > > > > diff --git a/drivers/nvdimm/nd_virtio.c b/drivers/nvdimm/nd_virtio.c
> > > > > > index 10351d5b49fa..61b655b583be 100644
> > > > > > --- a/drivers/nvdimm/nd_virtio.c
> > > > > > +++ b/drivers/nvdimm/nd_virtio.c
> > > > > > @@ -97,29 +97,69 @@ static int virtio_pmem_flush(struct nd_region 
> > > > > > *nd_region)
> > > > > > return err;
> > > > > >  };
> > > > > >
> > > > > > +static void submit_async_flush(struct work_struct *ws);
> > > > > > +
> > > > > >  /* The asynchronous flush callback function */
> > > > > >  int async_pmem_flush(struct nd_region *nd_region, struct bio *bio)
> > > > > >  {
> > > > > > -   /*
> > > > > > -* Create child bio for asynchronous flush and chain with
> > > > > > -* parent bio. Otherwise directly call nd_region flush.
> > > > > > +   /* queue asynchronous flush and coalesce the flush requests 
> > > > > > */
> > > > > > +   struct virtio_device *vdev = nd_region->provider_data;
> > > > > > +   struct virtio_pmem *vpmem  = vdev->priv;
> > > > > > +   ktime_t req_start = ktime_get_boottime();
> > > > > > +
> > > > > > +   spin_lock_irq(>lock);
> > > > > > +   /* flush requests wait until ongoing flush completes,
> > > > > > +* hence coalescing all the pending requests.
> > > > > >  */
> > > > > > -   if (bio && bio->bi_iter.bi_sector != -1) {
> > > > > > -   struct bio *child = bio_alloc(GFP_ATOMIC, 0);
> > > > > > -
> > > > > > -   if (!child)
> > > > > > -   return -ENOMEM;
> > > > > > -   bio_copy_dev(child, bio);
> > > > > > -   child->bi_opf = REQ_PREFLUSH;
> > > > > > -   child->bi_iter.bi_sector = -1;
> > > > > > -   bio_chain(child, bio);
> > > > > > -   submit_bio(child);
> > > > > > -   return 0;
> > > > > > +   wait_event_lock_irq(vpmem->sb_wait,
> > > > > > +   !vpmem->flush_bio ||
> > > > > > +   ktime_before(req_start, 
> > > > > > vpmem->prev_flush_start),
> > > > > > +   vpmem->lock);
> > > > > > +   /* new request after previous flush is completed */
> > > > > > +   if (ktime_after(req_start, vpmem->prev_flush_start)) {
> > > > > > +   WARN_ON(vpmem->flush_bio);
> > > > > > +   vpmem->flush_bio = bio;
> > > > > > +   bio = NULL;
> > > > > > +   }
> > > > >
> > > > > Why the dance with ->prev_flush_start vs just calling queue_work()
> > > > > again. queue_work() is naturally coalescing in that if the last work
> > > > > request has not started execution another queue attempt will be
> > > > > dropped.
> > > >
> > > > How parent flush request will know when corresponding flush is 
> > > > completed?
> > >
> > > The eventual bio_endio() is what signals upper layers that the flush
> > > completed...
> > >
> > >
> > > Hold on... it's been so long that I forgot that you are copying
> > > md_flush_request() here. It would help immensely if that was mentioned
> > > in the changelog and at a minimum have a comment in the code that this
> > > was copied from md. In fact it would be extra helpful if you
> >
> > My bad. I only mentioned this in the cover letter.
>
> Yeah, sorry about that. Having come back to this after so long I just
> decided to jump straight into the patches, but even if I had read that
> cover I still would have given the feedback that md_flush_request()
> heritage should also be noted with a comment in the code.

Sure.

Thanks,
Pankaj



Re: RDMA/rpma + fsdax(ext4) was broken since 36f30e486d

2021-08-27 Thread Jason Gunthorpe
On Fri, Aug 27, 2021 at 08:15:40AM +, lizhij...@fujitsu.com wrote:
> i looked over the change-log of hmm_vma_handle_pte(), and found that before
> 4055062 ("mm/hmm: add missing call to hmm_pte_need_fault in HMM_PFN_SPECIAL 
> handling")
> 
> hmm_vma_handle_pte() will not check pte_special(pte) if pte_devmap(pte) is 
> true.
> 
> when we reached
> "if (pte_special(pte) && !is_zero_pfn(pte_pfn(pte))) {"
> the pte have already presented and its pte's flag already fulfilled the 
> request flags.
> 
> 
> My question is that
> Per https://01.org/blogs/dave/2020/linux-consumption-x86-page-table-bits,
> pte_devmap(pte) and pte_special(pte) could be both true in fsdax user case, 
> right ?

How? what code creates that?

I see:

insert_pfn():
/* Ok, finally just insert the thing.. */
if (pfn_t_devmap(pfn))
entry = pte_mkdevmap(pfn_t_pte(pfn, prot));
else
entry = pte_mkspecial(pfn_t_pte(pfn, prot));

So what code path ends up setting both bits?

Jason



Re: RDMA/rpma + fsdax(ext4) was broken since 36f30e486d

2021-08-27 Thread lizhij...@fujitsu.com
Hi all

I have verified that below changes can solve this problem

diff --git a/mm/hmm.c b/mm/hmm.c
index 24f9ff95f3ae..2c9a3e3eefce 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -294,7 +294,7 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, 
unsigned long addr,
  * Since each architecture defines a struct page for the zero page, 
just
  * fall through and treat it like a normal page.
  */
-   if (pte_special(pte) && !is_zero_pfn(pte_pfn(pte))) {
+   if (!pte_devmap(pte) && pte_special(pte) && !is_zero_pfn(pte_pfn(pte))) 
{
     if (hmm_pte_need_fault(hmm_vma_walk, pfn_req_flags, 0)) {
     pte_unmap(ptep);
     return -EFAULT;


i looked over the change-log of hmm_vma_handle_pte(), and found that before
4055062 ("mm/hmm: add missing call to hmm_pte_need_fault in HMM_PFN_SPECIAL 
handling")

hmm_vma_handle_pte() will not check pte_special(pte) if pte_devmap(pte) is true.

when we reached
"if (pte_special(pte) && !is_zero_pfn(pte_pfn(pte))) {"
the pte have already presented and its pte's flag already fulfilled the request 
flags.


My question is that
Per https://01.org/blogs/dave/2020/linux-consumption-x86-page-table-bits,
pte_devmap(pte) and pte_special(pte) could be both true in fsdax user case, 
right ?

if so, what's the expected code path for fsdax case ?


commit 4055062749229101365e5f9e87cb1c5a93e292f8
Author: Jason Gunthorpe 
Date:   Thu Mar 5 14:27:20 2020 -0400

     mm/hmm: add missing call to hmm_pte_need_fault in HMM_PFN_SPECIAL handling

     Currently if a special PTE is encountered hmm_range_fault() immediately
     returns EFAULT and sets the HMM_PFN_SPECIAL error output (which nothing
     uses).

     EFAULT should only be returned after testing with hmm_pte_need_fault().

     Also pte_devmap() and pte_special() are exclusive, and there is no need to
     check IS_ENABLED, pte_special() is stubbed out to return false on
     unsupported architectures.

     Fixes: 992de9a8b751 ("mm/hmm: allow to mirror vma of a file on a DAX 
backed filesystem")
     Reviewed-by: Ralph Campbell 
     Reviewed-by: Christoph Hellwig 
     Signed-off-by: Jason Gunthorpe 

diff --git a/mm/hmm.c b/mm/hmm.c
index 3a03fcf..9c82ea9 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -339,16 +339,21 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, 
unsigned long addr,
     pte_unmap(ptep);
     return -EBUSY;
     }
-   } else if (IS_ENABLED(CONFIG_ARCH_HAS_PTE_SPECIAL) && pte_special(pte)) 
{
-   if (!is_zero_pfn(pte_pfn(pte))) {
+   }
+
+   /*
+    * Since each architecture defines a struct page for the zero page, just
+    * fall through and treat it like a normal page.
+    */
+   if (pte_special(pte) && !is_zero_pfn(pte_pfn(pte))) {
+   hmm_pte_need_fault(hmm_vma_walk, orig_pfn, 0, ,
+  _fault);
+   if (fault || write_fault) {
     pte_unmap(ptep);
-   *pfn = range->values[HMM_PFN_SPECIAL];
     return -EFAULT;
     }
-   /*
-    * Since each architecture defines a struct page for the zero
-    * page, just fall through and treat it like a normal page.
-    */
+   *pfn = range->values[HMM_PFN_SPECIAL];
+   return 0;
     }

     *pfn = hmm_device_entry_from_pfn(range, pte_pfn(pte)) | cpu_flags;


Thanks
Zhijian



On 06/08/2021 11:20, Li, Zhijian wrote:
> Hi Jason
>
> thank you for your advice.
>
> CCing nvdimm
>
> both ext4 and xfs are impacted.
>
> Thanks
>
>
> at 2021/8/6 9:45, Jason Gunthorpe wrote:
>> On Wed, Aug 04, 2021 at 04:06:53PM +0800, Li, Zhijian/李 智坚 wrote:
>>> convert to text and send again
>>>
>>> 2021/8/4 15:55, Li, Zhijian wrote:
 Hey all:

 Recently, i reported a issue to 
 rpmahttps://github.com/pmem/rpma/issues/1142
 where we found that the native rpma + fsdax example failed in recent 
 kernel.

 Below is the bisect log

 [lizhijian@yl linux]$ git bisect log
 git bisect start
 # good: [bbf5c979011a099af5dc76498918ed7df445635b] Linux 5.9
 git bisect good bbf5c979011a099af5dc76498918ed7df445635b
 # bad: [2c85ebc57b3e1817b6ce1a6b703928e113a90442] Linux 5.10
 git bisect bad 2c85ebc57b3e1817b6ce1a6b703928e113a90442
 # good: [4d0e9df5e43dba52d38b251e3b909df8fa1110be] lib, uaccess: add 
 failure injection to usercopy functions
 git bisect good 4d0e9df5e43dba52d38b251e3b909df8fa1110be
 # bad: [6694875ef8045cdb1e6712ee9b68fe08763507d8] ext4: indicate that 
 fast_commit is available via /sys/fs/ext4/feature/...
 git bisect bad 6694875ef8045cdb1e6712ee9b68fe08763507d8
 # good: [14c914fcb515c424177bb6848cc2858ebfe717a8] Merge tag 
 'wireless-drivers-next-2020-10-02' of