Re: RDMA/rpma + fsdax(ext4) was broken since 36f30e486d
On Fri, Aug 27, 2021 at 09:42:21AM -0700, Dan Williams wrote: > On Fri, Aug 27, 2021 at 6:05 AM Li, Zhijian wrote: > > > > > > on 2021/8/27 20:10, Jason Gunthorpe wrote: > > > On Fri, Aug 27, 2021 at 08:15:40AM +, lizhij...@fujitsu.com wrote: > > >> i looked over the change-log of hmm_vma_handle_pte(), and found that > > >> before > > >> 4055062 ("mm/hmm: add missing call to hmm_pte_need_fault in > > >> HMM_PFN_SPECIAL handling") > > >> > > >> hmm_vma_handle_pte() will not check pte_special(pte) if pte_devmap(pte) > > >> is true. > > >> > > >> when we reached > > >> "if (pte_special(pte) && !is_zero_pfn(pte_pfn(pte))) {" > > >> the pte have already presented and its pte's flag already fulfilled the > > >> request flags. > > >> > > >> > > >> My question is that > > >> Per https://01.org/blogs/dave/2020/linux-consumption-x86-page-table-bits, > > >> pte_devmap(pte) and pte_special(pte) could be both true in fsdax user > > >> case, right ? > > > How? what code creates that? > > > > > > I see: > > > > > > insert_pfn(): > > > /* Ok, finally just insert the thing.. */ > > > if (pfn_t_devmap(pfn)) > > > entry = pte_mkdevmap(pfn_t_pte(pfn, prot)); > > > else > > > entry = pte_mkspecial(pfn_t_pte(pfn, prot)); > > > > > > So what code path ends up setting both bits? > > > > pte_mkdevmap() will set both _PAGE_SPECIAL | PAGE_DEVMAP > > > > 395 static inline pte_t pte_mkdevmap(pte_t pte) > > 396 { > > 397 return pte_set_flags(pte, _PAGE_SPECIAL|_PAGE_DEVMAP); > > 398 } > > I can't recall why _PAGE_SPECIAL is there. I'll take a look, but I > think setting _PAGE_SPECIAL in pte_mkdevmap() is overkill. This is my feeling too, but every arch does it, so hmm should check it, at least for now as a stable fix devmap has a struct page so it should be refcounted inside the VMA and that is the main thing that PAGE_SPECIAL disabled, AFAICR.. The only places where pte_special are used that I wonder if are OK for devmap have to do with CPU cache maintenance vm_normal_page(), hmm_vma_handle_pte(), gup_pte_range() all look OK to drop the special bit Jason
Re: RDMA/rpma + fsdax(ext4) was broken since 36f30e486d
On Fri, Aug 27, 2021 at 6:05 AM Li, Zhijian wrote: > > > on 2021/8/27 20:10, Jason Gunthorpe wrote: > > On Fri, Aug 27, 2021 at 08:15:40AM +, lizhij...@fujitsu.com wrote: > >> i looked over the change-log of hmm_vma_handle_pte(), and found that before > >> 4055062 ("mm/hmm: add missing call to hmm_pte_need_fault in > >> HMM_PFN_SPECIAL handling") > >> > >> hmm_vma_handle_pte() will not check pte_special(pte) if pte_devmap(pte) is > >> true. > >> > >> when we reached > >> "if (pte_special(pte) && !is_zero_pfn(pte_pfn(pte))) {" > >> the pte have already presented and its pte's flag already fulfilled the > >> request flags. > >> > >> > >> My question is that > >> Per https://01.org/blogs/dave/2020/linux-consumption-x86-page-table-bits, > >> pte_devmap(pte) and pte_special(pte) could be both true in fsdax user > >> case, right ? > > How? what code creates that? > > > > I see: > > > > insert_pfn(): > > /* Ok, finally just insert the thing.. */ > > if (pfn_t_devmap(pfn)) > > entry = pte_mkdevmap(pfn_t_pte(pfn, prot)); > > else > > entry = pte_mkspecial(pfn_t_pte(pfn, prot)); > > > > So what code path ends up setting both bits? > > pte_mkdevmap() will set both _PAGE_SPECIAL | PAGE_DEVMAP > > 395 static inline pte_t pte_mkdevmap(pte_t pte) > 396 { > 397 return pte_set_flags(pte, _PAGE_SPECIAL|_PAGE_DEVMAP); > 398 } I can't recall why _PAGE_SPECIAL is there. I'll take a look, but I think setting _PAGE_SPECIAL in pte_mkdevmap() is overkill.
Re: RDMA/rpma + fsdax(ext4) was broken since 36f30e486d
on 2021/8/27 21:16, Jason Gunthorpe wrote: On Fri, Aug 27, 2021 at 09:05:21PM +0800, Li, Zhijian wrote: Yes, can you send a proper patch and include the mm mailing list? Of course, my pleasure Thanks
Re: RDMA/rpma + fsdax(ext4) was broken since 36f30e486d
On Fri, Aug 27, 2021 at 09:05:21PM +0800, Li, Zhijian wrote: > > on 2021/8/27 20:10, Jason Gunthorpe wrote: > > On Fri, Aug 27, 2021 at 08:15:40AM +, lizhij...@fujitsu.com wrote: > > > i looked over the change-log of hmm_vma_handle_pte(), and found that > > > before > > > 4055062 ("mm/hmm: add missing call to hmm_pte_need_fault in > > > HMM_PFN_SPECIAL handling") > > > > > > hmm_vma_handle_pte() will not check pte_special(pte) if pte_devmap(pte) > > > is true. > > > > > > when we reached > > > "if (pte_special(pte) && !is_zero_pfn(pte_pfn(pte))) {" > > > the pte have already presented and its pte's flag already fulfilled the > > > request flags. > > > > > > > > > My question is that > > > Per https://01.org/blogs/dave/2020/linux-consumption-x86-page-table-bits, > > > pte_devmap(pte) and pte_special(pte) could be both true in fsdax user > > > case, right ? > > How? what code creates that? > > > > I see: > > > > insert_pfn(): > > /* Ok, finally just insert the thing.. */ > > if (pfn_t_devmap(pfn)) > > entry = pte_mkdevmap(pfn_t_pte(pfn, prot)); > > else > > entry = pte_mkspecial(pfn_t_pte(pfn, prot)); > > > > So what code path ends up setting both bits? > > pte_mkdevmap() will set both _PAGE_SPECIAL | PAGE_DEVMAP > > 395 static inline pte_t pte_mkdevmap(pte_t pte) > 396 { > 397 return pte_set_flags(pte, _PAGE_SPECIAL|_PAGE_DEVMAP); > 398 } > > below is a calltrace example > > [ 400.728559] Call Trace: > [ 400.731595] dump_stack+0x6d/0x8b > [ 400.735536] insert_pfn+0x16c/0x180 > [ 400.739596] __vm_insert_mixed+0x84/0xc0 > [ 400.744144] dax_iomap_pte_fault+0x845/0x870 > [ 400.749089] ext4_dax_huge_fault+0x171/0x1e0 > [ 400.754096] __do_fault+0x31/0xe0 > [ 400.758090] ? pmd_devmap_trans_unstable+0x37/0x90 > [ 400.763541] handle_mm_fault+0x11b1/0x1680 > [ 400.768260] exc_page_fault+0x2f4/0x570 > [ 400.772788] ? asm_exc_page_fault+0x8/0x30 > [ 400.777539] asm_exc_page_fault+0x1e/0x30 > > > So is my previous change reasonable ? Yes, can you send a proper patch and include the mm mailing list? Jason
Re: RDMA/rpma + fsdax(ext4) was broken since 36f30e486d
on 2021/8/27 20:10, Jason Gunthorpe wrote: On Fri, Aug 27, 2021 at 08:15:40AM +, lizhij...@fujitsu.com wrote: i looked over the change-log of hmm_vma_handle_pte(), and found that before 4055062 ("mm/hmm: add missing call to hmm_pte_need_fault in HMM_PFN_SPECIAL handling") hmm_vma_handle_pte() will not check pte_special(pte) if pte_devmap(pte) is true. when we reached "if (pte_special(pte) && !is_zero_pfn(pte_pfn(pte))) {" the pte have already presented and its pte's flag already fulfilled the request flags. My question is that Per https://01.org/blogs/dave/2020/linux-consumption-x86-page-table-bits, pte_devmap(pte) and pte_special(pte) could be both true in fsdax user case, right ? How? what code creates that? I see: insert_pfn(): /* Ok, finally just insert the thing.. */ if (pfn_t_devmap(pfn)) entry = pte_mkdevmap(pfn_t_pte(pfn, prot)); else entry = pte_mkspecial(pfn_t_pte(pfn, prot)); So what code path ends up setting both bits? pte_mkdevmap() will set both _PAGE_SPECIAL | PAGE_DEVMAP 395 static inline pte_t pte_mkdevmap(pte_t pte) 396 { 397 return pte_set_flags(pte, _PAGE_SPECIAL|_PAGE_DEVMAP); 398 } below is a calltrace example [ 400.728559] Call Trace: [ 400.731595] dump_stack+0x6d/0x8b [ 400.735536] insert_pfn+0x16c/0x180 [ 400.739596] __vm_insert_mixed+0x84/0xc0 [ 400.744144] dax_iomap_pte_fault+0x845/0x870 [ 400.749089] ext4_dax_huge_fault+0x171/0x1e0 [ 400.754096] __do_fault+0x31/0xe0 [ 400.758090] ? pmd_devmap_trans_unstable+0x37/0x90 [ 400.763541] handle_mm_fault+0x11b1/0x1680 [ 400.768260] exc_page_fault+0x2f4/0x570 [ 400.772788] ? asm_exc_page_fault+0x8/0x30 [ 400.777539] asm_exc_page_fault+0x1e/0x30 So is my previous change reasonable ? Thanks Zhijian
Re: [RFC v2 1/2] virtio-pmem: Async virtio-pmem flush
> > > > > > Implement asynchronous flush for virtio pmem using work queue > > > > > > to solve the preflush ordering issue. Also, coalesce the flush > > > > > > requests when a flush is already in process. > > > > > > > > > > > > Signed-off-by: Pankaj Gupta > > > > > > --- > > > > > > drivers/nvdimm/nd_virtio.c | 72 > > > > > > > > > > > > drivers/nvdimm/virtio_pmem.c | 10 - > > > > > > drivers/nvdimm/virtio_pmem.h | 14 +++ > > > > > > 3 files changed, 79 insertions(+), 17 deletions(-) > > > > > > > > > > > > diff --git a/drivers/nvdimm/nd_virtio.c b/drivers/nvdimm/nd_virtio.c > > > > > > index 10351d5b49fa..61b655b583be 100644 > > > > > > --- a/drivers/nvdimm/nd_virtio.c > > > > > > +++ b/drivers/nvdimm/nd_virtio.c > > > > > > @@ -97,29 +97,69 @@ static int virtio_pmem_flush(struct nd_region > > > > > > *nd_region) > > > > > > return err; > > > > > > }; > > > > > > > > > > > > +static void submit_async_flush(struct work_struct *ws); > > > > > > + > > > > > > /* The asynchronous flush callback function */ > > > > > > int async_pmem_flush(struct nd_region *nd_region, struct bio *bio) > > > > > > { > > > > > > - /* > > > > > > -* Create child bio for asynchronous flush and chain with > > > > > > -* parent bio. Otherwise directly call nd_region flush. > > > > > > + /* queue asynchronous flush and coalesce the flush requests > > > > > > */ > > > > > > + struct virtio_device *vdev = nd_region->provider_data; > > > > > > + struct virtio_pmem *vpmem = vdev->priv; > > > > > > + ktime_t req_start = ktime_get_boottime(); > > > > > > + > > > > > > + spin_lock_irq(>lock); > > > > > > + /* flush requests wait until ongoing flush completes, > > > > > > +* hence coalescing all the pending requests. > > > > > > */ > > > > > > - if (bio && bio->bi_iter.bi_sector != -1) { > > > > > > - struct bio *child = bio_alloc(GFP_ATOMIC, 0); > > > > > > - > > > > > > - if (!child) > > > > > > - return -ENOMEM; > > > > > > - bio_copy_dev(child, bio); > > > > > > - child->bi_opf = REQ_PREFLUSH; > > > > > > - child->bi_iter.bi_sector = -1; > > > > > > - bio_chain(child, bio); > > > > > > - submit_bio(child); > > > > > > - return 0; > > > > > > + wait_event_lock_irq(vpmem->sb_wait, > > > > > > + !vpmem->flush_bio || > > > > > > + ktime_before(req_start, > > > > > > vpmem->prev_flush_start), > > > > > > + vpmem->lock); > > > > > > + /* new request after previous flush is completed */ > > > > > > + if (ktime_after(req_start, vpmem->prev_flush_start)) { > > > > > > + WARN_ON(vpmem->flush_bio); > > > > > > + vpmem->flush_bio = bio; > > > > > > + bio = NULL; > > > > > > + } > > > > > > > > > > Why the dance with ->prev_flush_start vs just calling queue_work() > > > > > again. queue_work() is naturally coalescing in that if the last work > > > > > request has not started execution another queue attempt will be > > > > > dropped. > > > > > > > > How parent flush request will know when corresponding flush is > > > > completed? > > > > > > The eventual bio_endio() is what signals upper layers that the flush > > > completed... > > > > > > > > > Hold on... it's been so long that I forgot that you are copying > > > md_flush_request() here. It would help immensely if that was mentioned > > > in the changelog and at a minimum have a comment in the code that this > > > was copied from md. In fact it would be extra helpful if you > > > > My bad. I only mentioned this in the cover letter. > > Yeah, sorry about that. Having come back to this after so long I just > decided to jump straight into the patches, but even if I had read that > cover I still would have given the feedback that md_flush_request() > heritage should also be noted with a comment in the code. Sure. Thanks, Pankaj
Re: RDMA/rpma + fsdax(ext4) was broken since 36f30e486d
On Fri, Aug 27, 2021 at 08:15:40AM +, lizhij...@fujitsu.com wrote: > i looked over the change-log of hmm_vma_handle_pte(), and found that before > 4055062 ("mm/hmm: add missing call to hmm_pte_need_fault in HMM_PFN_SPECIAL > handling") > > hmm_vma_handle_pte() will not check pte_special(pte) if pte_devmap(pte) is > true. > > when we reached > "if (pte_special(pte) && !is_zero_pfn(pte_pfn(pte))) {" > the pte have already presented and its pte's flag already fulfilled the > request flags. > > > My question is that > Per https://01.org/blogs/dave/2020/linux-consumption-x86-page-table-bits, > pte_devmap(pte) and pte_special(pte) could be both true in fsdax user case, > right ? How? what code creates that? I see: insert_pfn(): /* Ok, finally just insert the thing.. */ if (pfn_t_devmap(pfn)) entry = pte_mkdevmap(pfn_t_pte(pfn, prot)); else entry = pte_mkspecial(pfn_t_pte(pfn, prot)); So what code path ends up setting both bits? Jason
Re: RDMA/rpma + fsdax(ext4) was broken since 36f30e486d
Hi all I have verified that below changes can solve this problem diff --git a/mm/hmm.c b/mm/hmm.c index 24f9ff95f3ae..2c9a3e3eefce 100644 --- a/mm/hmm.c +++ b/mm/hmm.c @@ -294,7 +294,7 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr, * Since each architecture defines a struct page for the zero page, just * fall through and treat it like a normal page. */ - if (pte_special(pte) && !is_zero_pfn(pte_pfn(pte))) { + if (!pte_devmap(pte) && pte_special(pte) && !is_zero_pfn(pte_pfn(pte))) { if (hmm_pte_need_fault(hmm_vma_walk, pfn_req_flags, 0)) { pte_unmap(ptep); return -EFAULT; i looked over the change-log of hmm_vma_handle_pte(), and found that before 4055062 ("mm/hmm: add missing call to hmm_pte_need_fault in HMM_PFN_SPECIAL handling") hmm_vma_handle_pte() will not check pte_special(pte) if pte_devmap(pte) is true. when we reached "if (pte_special(pte) && !is_zero_pfn(pte_pfn(pte))) {" the pte have already presented and its pte's flag already fulfilled the request flags. My question is that Per https://01.org/blogs/dave/2020/linux-consumption-x86-page-table-bits, pte_devmap(pte) and pte_special(pte) could be both true in fsdax user case, right ? if so, what's the expected code path for fsdax case ? commit 4055062749229101365e5f9e87cb1c5a93e292f8 Author: Jason Gunthorpe Date: Thu Mar 5 14:27:20 2020 -0400 mm/hmm: add missing call to hmm_pte_need_fault in HMM_PFN_SPECIAL handling Currently if a special PTE is encountered hmm_range_fault() immediately returns EFAULT and sets the HMM_PFN_SPECIAL error output (which nothing uses). EFAULT should only be returned after testing with hmm_pte_need_fault(). Also pte_devmap() and pte_special() are exclusive, and there is no need to check IS_ENABLED, pte_special() is stubbed out to return false on unsupported architectures. Fixes: 992de9a8b751 ("mm/hmm: allow to mirror vma of a file on a DAX backed filesystem") Reviewed-by: Ralph Campbell Reviewed-by: Christoph Hellwig Signed-off-by: Jason Gunthorpe diff --git a/mm/hmm.c b/mm/hmm.c index 3a03fcf..9c82ea9 100644 --- a/mm/hmm.c +++ b/mm/hmm.c @@ -339,16 +339,21 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr, pte_unmap(ptep); return -EBUSY; } - } else if (IS_ENABLED(CONFIG_ARCH_HAS_PTE_SPECIAL) && pte_special(pte)) { - if (!is_zero_pfn(pte_pfn(pte))) { + } + + /* + * Since each architecture defines a struct page for the zero page, just + * fall through and treat it like a normal page. + */ + if (pte_special(pte) && !is_zero_pfn(pte_pfn(pte))) { + hmm_pte_need_fault(hmm_vma_walk, orig_pfn, 0, , + _fault); + if (fault || write_fault) { pte_unmap(ptep); - *pfn = range->values[HMM_PFN_SPECIAL]; return -EFAULT; } - /* - * Since each architecture defines a struct page for the zero - * page, just fall through and treat it like a normal page. - */ + *pfn = range->values[HMM_PFN_SPECIAL]; + return 0; } *pfn = hmm_device_entry_from_pfn(range, pte_pfn(pte)) | cpu_flags; Thanks Zhijian On 06/08/2021 11:20, Li, Zhijian wrote: > Hi Jason > > thank you for your advice. > > CCing nvdimm > > both ext4 and xfs are impacted. > > Thanks > > > at 2021/8/6 9:45, Jason Gunthorpe wrote: >> On Wed, Aug 04, 2021 at 04:06:53PM +0800, Li, Zhijian/李 智坚 wrote: >>> convert to text and send again >>> >>> 2021/8/4 15:55, Li, Zhijian wrote: Hey all: Recently, i reported a issue to rpmahttps://github.com/pmem/rpma/issues/1142 where we found that the native rpma + fsdax example failed in recent kernel. Below is the bisect log [lizhijian@yl linux]$ git bisect log git bisect start # good: [bbf5c979011a099af5dc76498918ed7df445635b] Linux 5.9 git bisect good bbf5c979011a099af5dc76498918ed7df445635b # bad: [2c85ebc57b3e1817b6ce1a6b703928e113a90442] Linux 5.10 git bisect bad 2c85ebc57b3e1817b6ce1a6b703928e113a90442 # good: [4d0e9df5e43dba52d38b251e3b909df8fa1110be] lib, uaccess: add failure injection to usercopy functions git bisect good 4d0e9df5e43dba52d38b251e3b909df8fa1110be # bad: [6694875ef8045cdb1e6712ee9b68fe08763507d8] ext4: indicate that fast_commit is available via /sys/fs/ext4/feature/... git bisect bad 6694875ef8045cdb1e6712ee9b68fe08763507d8 # good: [14c914fcb515c424177bb6848cc2858ebfe717a8] Merge tag 'wireless-drivers-next-2020-10-02' of