Re: [RFC PATCH 0/7] RDMA/rxe: On-Demand Paging on SoftRoCE

2022-09-08 Thread Li Zhijian

Daisuke

Great job.

I love this feature, before starting reviewing you patches, i tested it with 
QEMU(with fsdax memory-backend) migration
over RDMA where it worked for MLX5 before.

This time, with you ODP patches, it works on RXE though ibv_advise_mr may be 
not yet ready.


Thanks
Zhijian


On 07/09/2022 10:42, Daisuke Matsuda wrote:

Hi everyone,

This patch series implements the On-Demand Paging feature on SoftRoCE(rxe)
driver, which has been available only in mlx5 driver[1] so far.

[Overview]
When applications register a memory region(MR), RDMA drivers normally pin
pages in the MR so that physical addresses are never changed during RDMA
communication. This requires the MR to fit in physical memory and
inevitably leads to memory pressure. On the other hand, On-Demand Paging
(ODP) allows applications to register MRs without pinning pages. They are
paged-in when the driver requires and paged-out when the OS reclaims. As a
result, it is possible to register a large MR that does not fit in physical
memory without taking up so much physical memory.

[Why to add this feature?]
We, Fujitsu, have contributed to RDMA with a view to using it with
persistent memory. Persistent memory can host a filesystem that allows
applications to read/write files directly without involving page cache.
This is called FS-DAX(filesystem direct access) mode. There is a problem
that data on DAX-enabled filesystem cannot be duplicated with software RAID
or other hardware methods. Data replication with RDMA, which features
high-speed connections, is the best solution for the problem.

However, there is a known issue that hinders using RDMA with FS-DAX. When
RDMA operations to a file and update of the file metadata are processed
concurrently on the same node, illegal memory accesses can be executed,
disregarding the updated metadata. This is because RDMA operations do not
go through page cache but access data directly. There was an effort[2] to
solve this problem, but it was rejected in the end. Though there is no
general solution available, it is possible to work around the problem using
the ODP feature that has been available only in mlx5. ODP enables drivers
to update metadata before processing RDMA operations.

We have enhanced the rxe to expedite the usage of persistent memory. Our
contribution to rxe includes RDMA Atomic write[3] and RDMA Flush[4]. With
them being merged to rxe along with ODP, an environment will be ready for
developers to create and test software for RDMA with FS-DAX. There is a
library(librpma)[5] being developed for this purpose. This environment
can be used by anybody without any special hardware but an ordinary
computer with a normal NIC though it is inferior to hardware
implementations in terms of performance.

[Design considerations]
ODP has been available only in mlx5, but functions and data structures
that can be used commonly are provided in ib_uverbs(infiniband/core). The
interface is heavily dependent on HMM infrastructure[6], and this patchset
use them as much as possible. While mlx5 has both Explicit and Implicit ODP
features along with prefetch feature, this patchset implements the Explicit
ODP feature only.

As an important change, it is necessary to convert triple tasklets
(requester, responder and completer) to workqueues because they must be
able to sleep in order to trigger page fault before accessing MRs. I did a
test shown in the 2nd patch and found that the change makes the latency
higher while improving the bandwidth. Though it may be possible to create a
new independent workqueue for page fault execution, it is a not very
sensible solution since the tasklets have to busy-wait its completion in
that case.

If responder and completer sleep, it becomes more likely that packet drop
occurs because of overflow in receiver queue. There are multiple queues
involved, but, as SoftRoCE uses UDP, the most important one would be the
UDP buffers. The size can be configured in net.core.rmem_default and
net.core.rmem_max sysconfig parameters. Users should change these values in
case of packet drop, but page fault would be typically not so long as to
cause the problem.

[How does ODP work?]
"struct ib_umem_odp" is used to manage pages. It is created for each
ODP-enabled MR on its registration. This struct holds a pair of arrays
(dma_list/pfn_list) that serve as a driver page table. DMA addresses and
PFNs are stored in the driver page table. They are updated on page-in and
page-out, both of which use the common interface in ib_uverbs.

Page-in can occur when requester, responder or completer access an MR in
order to process RDMA operations. If they find that the pages being
accessed are not present on physical memory or requisite permissions are
not set on the pages, they provoke page fault to make pages present with
proper permissions and at the same time update the driver page table. After
confirming the presence of the pages, they execute memory access such as
read, write or atomic operations.


Re: [RFC PATCH 6/7] RDMA/rxe: Add support for Send/Recv/Write/Read operations with ODP

2022-09-08 Thread matsuda-dais...@fujitsu.com
On Thu, Sep 8, 2022 5:30 PM Leon Romanovsky wrote:
> On Wed, Sep 07, 2022 at 11:43:04AM +0900, Daisuke Matsuda wrote:
> > rxe_mr_copy() is used widely to copy data to/from a user MR. requester uses
> > it to load payloads of requesting packets; responder uses it to process
> > Send, Write, and Read operaetions; completer uses it to copy data from
> > response packets of Read and Atomic operations to a user MR.
> >
> > Allow these operations to be used with ODP by adding a counterpart function
> > rxe_odp_mr_copy(). It is comprised of the following steps:
> >  1. Check the driver page table(umem_odp->dma_list) to see if pages being
> > accessed are present with appropriate permission.
> >  2. If necessary, trigger page fault to map the pages.
> >  3. Convert their user space addresses to kernel logical addresses using
> > PFNs in the driver page table(umem_odp->pfn_list).
> >  4. Execute data copy fo/from the pages.
> >
> > umem_mutex is used to ensure that dma_list (an array of addresses of an MR)
> > is not changed while it is checked and that mapped pages are not
> > invalidated before data copy completes.
> >
> > Signed-off-by: Daisuke Matsuda 
> > ---
> >  drivers/infiniband/sw/rxe/rxe.c  |  10 ++
> >  drivers/infiniband/sw/rxe/rxe_loc.h  |   2 +
> >  drivers/infiniband/sw/rxe/rxe_mr.c   |   2 +-
> >  drivers/infiniband/sw/rxe/rxe_odp.c  | 173 +++
> >  drivers/infiniband/sw/rxe/rxe_resp.c |   6 +-
> >  5 files changed, 190 insertions(+), 3 deletions(-)
> 
> <...>
> 
> > +/* umem mutex is always locked when returning from this function. */
> > +static int rxe_odp_map_range(struct rxe_mr *mr, u64 iova, int length, u32 
> > flags)
> > +{
> > +   struct ib_umem_odp *umem_odp = to_ib_umem_odp(mr->umem);
> > +   const int max_tries = 3;
> > +   int cnt = 0;
> > +
> > +   int err;
> > +   u64 perm;
> > +   bool need_fault;
> > +
> > +   if (unlikely(length < 1))
> > +   return -EINVAL;
> > +
> > +   perm = ODP_READ_ALLOWED_BIT;
> > +   if (!(flags & RXE_PAGEFAULT_RDONLY))
> > +   perm |= ODP_WRITE_ALLOWED_BIT;
> > +
> > +   mutex_lock(_odp->umem_mutex);
> > +
> > +   /*
> > +* A successful return from rxe_odp_do_pagefault() does not guarantee
> > +* that all pages in the range became present. Recheck the DMA address
> > +* array, allowing max 3 tries for pagefault.
> > +*/
> > +   while ((need_fault = rxe_is_pagefault_neccesary(umem_odp,
> > +   iova, length, perm))) {
> > +
> > +   if (cnt >= max_tries)
> > +   break;
> > +
> > +   mutex_unlock(_odp->umem_mutex);
> > +
> > +   /* rxe_odp_do_pagefault() locks the umem mutex. */
> 
> Maybe it is correct and safe to release lock in the middle, but it is
> not clear. The whole pattern of taking lock in one function and later
> releasing it in another doesn't look right to me.

When the driver finds the pages are not mapped in rxe_is_pagefault_neccesary(),
it releases the lock to let the kernel execute page invalidation meantime,
and takes the lock again to do page fault in ib_umem_odp_map_dma_and_lock().
Then, it proceed to rxe_is_pagefault_neccesary() again with the lock taken.

I admit the usage of the lock is quite confusing. 
It is locked before making it clear that the target pages are present.
It is released when the target pages are missing and page fault is required,
or when access to the target pages in a MR is done.

I will move some lock taking/releasing operations to rxe_odp_mr_copy()
and rxe_odp_atomic_ops() so that people can understand the situation easier.
Also, I will rethink the way I explain it in comments and the patch description.

Thank you,
Daisuke Matsuda

> 
> Thanks


RE: [RFC PATCH 5/7] RDMA/rxe: Allow registering MRs for On-Demand Paging

2022-09-08 Thread matsuda-dais...@fujitsu.com
On Fri, Sep 9, 2022 1:58 AM Haris Iqbal wrote:
> On Wed, Sep 7, 2022 at 4:45 AM Daisuke Matsuda
>  wrote:
> >
> > Allow applications to register an ODP-enabled MR, in which case the flag
> > IB_ACCESS_ON_DEMAND is passed to rxe_reg_user_mr(). However, there is no
> > RDMA operation supported right now. They will be enabled later in the
> > subsequent two patches.
> >
> > rxe_odp_do_pagefault() is called to initialize an ODP-enabled MR here.
> > It syncs process address space from the CPU page table to the driver page
> > table(dma_list/pfn_list in umem_odp) when called with a
> > RXE_PAGEFAULT_SNAPSHOT flag. Additionally, It can be used to trigger page
> > fault when pages being accessed are not present or do not have proper
> > read/write permissions and possibly to prefetch pages in the future.
> >
> > Signed-off-by: Daisuke Matsuda 
> > ---
> >  drivers/infiniband/sw/rxe/rxe.c   |  7 +++
> >  drivers/infiniband/sw/rxe/rxe_loc.h   |  5 ++
> >  drivers/infiniband/sw/rxe/rxe_mr.c|  7 ++-
> >  drivers/infiniband/sw/rxe/rxe_odp.c   | 80 +++
> >  drivers/infiniband/sw/rxe/rxe_resp.c  | 21 +--
> >  drivers/infiniband/sw/rxe/rxe_verbs.c |  8 ++-
> >  drivers/infiniband/sw/rxe/rxe_verbs.h |  2 +
> >  7 files changed, 121 insertions(+), 9 deletions(-)
> >

<...>

> > diff --git a/drivers/infiniband/sw/rxe/rxe_resp.c 
> > b/drivers/infiniband/sw/rxe/rxe_resp.c
> > index cadc8fa64dd0..dd8632e783f6 100644
> > --- a/drivers/infiniband/sw/rxe/rxe_resp.c
> > +++ b/drivers/infiniband/sw/rxe/rxe_resp.c
> > @@ -535,8 +535,12 @@ static enum resp_states write_data_in(struct rxe_qp 
> > *qp,
> > int err;
> > int data_len = payload_size(pkt);
> >
> > -   err = rxe_mr_copy(qp->resp.mr, qp->resp.va + qp->resp.offset,
> > - payload_addr(pkt), data_len, RXE_TO_MR_OBJ);
> > +   if (qp->resp.mr->odp_enabled)
> 
> You cannot use qp->resp.mr here, because for zero byte operations,
> resp.mr is not set in the function check_rkey().
> 
> The code fails for RTRS with the following stack trace,
> 
> [Thu Sep  8 20:12:22 2022] BUG: kernel NULL pointer dereference,
> address: 0158
> [Thu Sep  8 20:12:22 2022] #PF: supervisor read access in kernel mode
> [Thu Sep  8 20:12:22 2022] #PF: error_code(0x) - not-present page
> [Thu Sep  8 20:12:22 2022] PGD 0 P4D 0
> [Thu Sep  8 20:12:22 2022] Oops:  [#1] PREEMPT SMP
> [Thu Sep  8 20:12:22 2022] CPU: 3 PID: 38 Comm: kworker/u8:1 Not
> tainted 6.0.0-rc2-pserver+ #17
> [Thu Sep  8 20:12:22 2022] Hardware name: QEMU Standard PC (i440FX +
> PIIX, 1996), BIOS 1.13.0-1ubuntu1.1 04/01/2014
> [Thu Sep  8 20:12:22 2022] Workqueue: rxe_resp rxe_do_work [rdma_rxe]
> [Thu Sep  8 20:12:22 2022] RIP: 0010:rxe_responder+0x1910/0x1d90 [rdma_rxe]
> [Thu Sep  8 20:12:22 2022] Code: 06 48 63 88 fc 15 63 c0 0f b6 46 01
> 83 ea 04 c0 e8 04 29 ca 83 e0 03 29 c2 49 8b 87 08 05 00 00 49 03 87
> 00 05 00 00 4c 63 ea <80> bf 58 01 00 00 00 48 8d 14 0e 48 89 c6 4d 89
> ee 44 89 e9 0f 84
> [Thu Sep  8 20:12:22 2022] RSP: 0018:b0358015fd80 EFLAGS: 00010246
> [Thu Sep  8 20:12:22 2022] RAX:  RBX: 9af4839b5e28
> RCX: 0020
> [Thu Sep  8 20:12:22 2022] RDX:  RSI: 9af485094a6a
> RDI: 
> [Thu Sep  8 20:12:22 2022] RBP: 9af488bd7128 R08: 
> R09: 
> [Thu Sep  8 20:12:22 2022] R10: 9af4808eaf7c R11: 0001
> R12: 0008
> [Thu Sep  8 20:12:22 2022] R13:  R14: 9af488bd7380
> R15: 9af488bd7000
> [Thu Sep  8 20:12:22 2022] FS:  ()
> GS:9af5b7d8() knlGS:
> [Thu Sep  8 20:12:22 2022] CS:  0010 DS:  ES:  CR0: 80050033
> [Thu Sep  8 20:12:22 2022] CR2: 0158 CR3: 4a60a000
> CR4: 06e0
> [Thu Sep  8 20:12:22 2022] DR0:  DR1: 
> DR2: 
> [Thu Sep  8 20:12:22 2022] DR3:  DR6: fffe0ff0
> DR7: 0400
> [Thu Sep  8 20:12:22 2022] Call Trace:
> [Thu Sep  8 20:12:22 2022]  
> [Thu Sep  8 20:12:22 2022]  ? newidle_balance+0x2e5/0x400
> [Thu Sep  8 20:12:22 2022]  ? _raw_spin_unlock+0x12/0x30
> [Thu Sep  8 20:12:22 2022]  ? finish_task_switch+0x91/0x2a0
> [Thu Sep  8 20:12:22 2022]  rxe_do_work+0x86/0x110 [rdma_rxe]
> [Thu Sep  8 20:12:22 2022]  process_one_work+0x1dc/0x3a0
> [Thu Sep  8 20:12:22 2022]  worker_thread+0x4a/0x3b0
> [Thu Sep  8 20:12:22 2022]  ? process_one_work+0x3a0/0x3a0
> [Thu Sep  8 20:12:22 2022]  kthread+0xe7/0x110
> [Thu Sep  8 20:12:22 2022]  ? kthread_complete_and_exit+0x20/0x20
> [Thu Sep  8 20:12:22 2022]  ret_from_fork+0x22/0x30
> [Thu Sep  8 20:12:22 2022]  
> [Thu Sep  8 20:12:22 2022] Modules linked in: rnbd_server rtrs_server
> rtrs_core rdma_ucm rdma_cm iw_cm ib_cm crc32_generic rdma_rxe
> ip6_udp_tunnel udp_tunnel ib_uverbs ib_core loop null_blk
> [Thu Sep  8 20:12:22 2022] CR2: 

Re: [PATCH -next] memregion: Add arch_flush_memregion() interface

2022-09-08 Thread Dan Williams
Andrew Morton wrote:
> On Thu, 8 Sep 2022 15:51:50 -0700 Dan Williams  
> wrote:
> 
> > Jonathan Cameron wrote:
> > > On Wed, 7 Sep 2022 18:07:31 -0700
> > > Dan Williams  wrote:
> > > 
> > > > Andrew Morton wrote:
> > > > > I really dislike the term "flush".  Sometimes it means writeback,
> > > > > sometimes it means invalidate.  Perhaps at other times it means
> > > > > both.
> > > > > 
> > > > > Can we please be very clear in comments and changelogs about exactly
> > > > > what this "flush" does.   With bonus points for being more specific 
> > > > > in the 
> > > > > function naming?
> > > > >   
> > > > 
> > > > That's a good point, "flush" has been cargo-culted along in Linux's
> > > > cache management APIs to mean write-back-and-invalidate. In this case I
> > > > think this API is purely about invalidate. It just so happens that x86
> > > > has not historically had a global invalidate instruction readily
> > > > available which leads to the overuse of wbinvd.
> > > > 
> > > > It would be nice to make clear that this API is purely about
> > > > invalidating any data cached for a physical address impacted by address
> > > > space management event (secure erase / new region provision). Write-back
> > > > is an unnecessary side-effect.
> > > > 
> > > > So how about:
> > > > 
> > > > s/arch_flush_memregion/cpu_cache_invalidate_memregion/?
> > > 
> > > Want to indicate it 'might' write back perhaps?
> > > So could be invalidate or clean and invalidate (using arm ARM terms just 
> > > to add
> > > to the confusion ;)
> > > 
> > > Feels like there will be potential race conditions where that matters as 
> > > we might
> > > force stale data to be written back.
> > > 
> > > Perhaps a comment is enough for that. Anyone have the "famous last words" 
> > > feeling?
> > 
> > Is "invalidate" not clear that write-back is optional? Maybe not.
> 
> Yes, I'd say that "invalidate" means "dirty stuff may of may not have
> been written back".  Ditto for invalidate_inode_pages2().
> 
> > Also, I realized that we tried to include the address range to allow for
> > the possibility of flushing by virtual address range, but that
> > overcomplicates the use. I.e. if someone issue secure erase and the
> > region association is not established does that mean that mean that the
> > cache invalidation is not needed? It could be the case that someone
> > disables a device, does the secure erase, and then reattaches to the
> > same region. The cache invalidation is needed, but at the time of the
> > secure erase the HPA was unknown.
> > 
> > All this to say that I feel the bikeshedding will need to continue until
> > morale improves.
> > 
> > I notice that the DMA API uses 'sync' to indicate, "make this memory
> > consistent/coherent for the CPU or the device", so how about an API like
> > 
> > memregion_sync_for_cpu(int res_desc)
> > 
> > ...where the @res_desc would be IORES_DESC_CXL for all CXL and
> > IORES_DESC_PERSISTENT_MEMORY for the current nvdimm use case.
> 
> "sync" is another of my pet peeves ;) In filesystem land, at least. 
> Does it mean "start writeback and return" or does it mean "start
> writeback, wait for its completion then return".  

Ok, no "sync" :).

/**
 * cpu_cache_invalidate_memregion - drop any CPU cached data for
 * memregions described by @res_des
 * @res_desc: one of the IORES_DESC_* types
 *
 * Perform cache maintenance after a memory event / operation that
 * changes the contents of physical memory in a cache-incoherent manner.
 * For example, memory-device secure erase, or provisioning new CXL
 * regions. This routine may or may not write back any dirty contents
 * while performing the invalidation.
 *
 * Returns 0 on success or negative error code on a failure to perform
 * the cache maintenance.
 */
int cpu_cache_invalidate_memregion(int res_desc)

??



Re: [PATCH -next] memregion: Add arch_flush_memregion() interface

2022-09-08 Thread Andrew Morton
On Thu, 8 Sep 2022 15:51:50 -0700 Dan Williams  wrote:

> Jonathan Cameron wrote:
> > On Wed, 7 Sep 2022 18:07:31 -0700
> > Dan Williams  wrote:
> > 
> > > Andrew Morton wrote:
> > > > I really dislike the term "flush".  Sometimes it means writeback,
> > > > sometimes it means invalidate.  Perhaps at other times it means
> > > > both.
> > > > 
> > > > Can we please be very clear in comments and changelogs about exactly
> > > > what this "flush" does.   With bonus points for being more specific in 
> > > > the 
> > > > function naming?
> > > >   
> > > 
> > > That's a good point, "flush" has been cargo-culted along in Linux's
> > > cache management APIs to mean write-back-and-invalidate. In this case I
> > > think this API is purely about invalidate. It just so happens that x86
> > > has not historically had a global invalidate instruction readily
> > > available which leads to the overuse of wbinvd.
> > > 
> > > It would be nice to make clear that this API is purely about
> > > invalidating any data cached for a physical address impacted by address
> > > space management event (secure erase / new region provision). Write-back
> > > is an unnecessary side-effect.
> > > 
> > > So how about:
> > > 
> > > s/arch_flush_memregion/cpu_cache_invalidate_memregion/?
> > 
> > Want to indicate it 'might' write back perhaps?
> > So could be invalidate or clean and invalidate (using arm ARM terms just to 
> > add
> > to the confusion ;)
> > 
> > Feels like there will be potential race conditions where that matters as we 
> > might
> > force stale data to be written back.
> > 
> > Perhaps a comment is enough for that. Anyone have the "famous last words" 
> > feeling?
> 
> Is "invalidate" not clear that write-back is optional? Maybe not.

Yes, I'd say that "invalidate" means "dirty stuff may of may not have
been written back".  Ditto for invalidate_inode_pages2().

> Also, I realized that we tried to include the address range to allow for
> the possibility of flushing by virtual address range, but that
> overcomplicates the use. I.e. if someone issue secure erase and the
> region association is not established does that mean that mean that the
> cache invalidation is not needed? It could be the case that someone
> disables a device, does the secure erase, and then reattaches to the
> same region. The cache invalidation is needed, but at the time of the
> secure erase the HPA was unknown.
> 
> All this to say that I feel the bikeshedding will need to continue until
> morale improves.
> 
> I notice that the DMA API uses 'sync' to indicate, "make this memory
> consistent/coherent for the CPU or the device", so how about an API like
> 
> memregion_sync_for_cpu(int res_desc)
> 
> ...where the @res_desc would be IORES_DESC_CXL for all CXL and
> IORES_DESC_PERSISTENT_MEMORY for the current nvdimm use case.

"sync" is another of my pet peeves ;) In filesystem land, at least. 
Does it mean "start writeback and return" or does it mean "start
writeback, wait for its completion then return".  




Re: [PATCH -next] memregion: Add arch_flush_memregion() interface

2022-09-08 Thread Dan Williams
Jonathan Cameron wrote:
> On Wed, 7 Sep 2022 18:07:31 -0700
> Dan Williams  wrote:
> 
> > Andrew Morton wrote:
> > > I really dislike the term "flush".  Sometimes it means writeback,
> > > sometimes it means invalidate.  Perhaps at other times it means
> > > both.
> > > 
> > > Can we please be very clear in comments and changelogs about exactly
> > > what this "flush" does.   With bonus points for being more specific in 
> > > the 
> > > function naming?
> > >   
> > 
> > That's a good point, "flush" has been cargo-culted along in Linux's
> > cache management APIs to mean write-back-and-invalidate. In this case I
> > think this API is purely about invalidate. It just so happens that x86
> > has not historically had a global invalidate instruction readily
> > available which leads to the overuse of wbinvd.
> > 
> > It would be nice to make clear that this API is purely about
> > invalidating any data cached for a physical address impacted by address
> > space management event (secure erase / new region provision). Write-back
> > is an unnecessary side-effect.
> > 
> > So how about:
> > 
> > s/arch_flush_memregion/cpu_cache_invalidate_memregion/?
> 
> Want to indicate it 'might' write back perhaps?
> So could be invalidate or clean and invalidate (using arm ARM terms just to 
> add
> to the confusion ;)
> 
> Feels like there will be potential race conditions where that matters as we 
> might
> force stale data to be written back.
> 
> Perhaps a comment is enough for that. Anyone have the "famous last words" 
> feeling?

Is "invalidate" not clear that write-back is optional? Maybe not.

Also, I realized that we tried to include the address range to allow for
the possibility of flushing by virtual address range, but that
overcomplicates the use. I.e. if someone issue secure erase and the
region association is not established does that mean that mean that the
cache invalidation is not needed? It could be the case that someone
disables a device, does the secure erase, and then reattaches to the
same region. The cache invalidation is needed, but at the time of the
secure erase the HPA was unknown.

All this to say that I feel the bikeshedding will need to continue until
morale improves.

I notice that the DMA API uses 'sync' to indicate, "make this memory
consistent/coherent for the CPU or the device", so how about an API like

memregion_sync_for_cpu(int res_desc)

...where the @res_desc would be IORES_DESC_CXL for all CXL and
IORES_DESC_PERSISTENT_MEMORY for the current nvdimm use case.



Re: [RFC PATCH 5/7] RDMA/rxe: Allow registering MRs for On-Demand Paging

2022-09-08 Thread Haris Iqbal
On Wed, Sep 7, 2022 at 4:45 AM Daisuke Matsuda
 wrote:
>
> Allow applications to register an ODP-enabled MR, in which case the flag
> IB_ACCESS_ON_DEMAND is passed to rxe_reg_user_mr(). However, there is no
> RDMA operation supported right now. They will be enabled later in the
> subsequent two patches.
>
> rxe_odp_do_pagefault() is called to initialize an ODP-enabled MR here.
> It syncs process address space from the CPU page table to the driver page
> table(dma_list/pfn_list in umem_odp) when called with a
> RXE_PAGEFAULT_SNAPSHOT flag. Additionally, It can be used to trigger page
> fault when pages being accessed are not present or do not have proper
> read/write permissions and possibly to prefetch pages in the future.
>
> Signed-off-by: Daisuke Matsuda 
> ---
>  drivers/infiniband/sw/rxe/rxe.c   |  7 +++
>  drivers/infiniband/sw/rxe/rxe_loc.h   |  5 ++
>  drivers/infiniband/sw/rxe/rxe_mr.c|  7 ++-
>  drivers/infiniband/sw/rxe/rxe_odp.c   | 80 +++
>  drivers/infiniband/sw/rxe/rxe_resp.c  | 21 +--
>  drivers/infiniband/sw/rxe/rxe_verbs.c |  8 ++-
>  drivers/infiniband/sw/rxe/rxe_verbs.h |  2 +
>  7 files changed, 121 insertions(+), 9 deletions(-)
>
> diff --git a/drivers/infiniband/sw/rxe/rxe.c b/drivers/infiniband/sw/rxe/rxe.c
> index 51daac5c4feb..0719f451253c 100644
> --- a/drivers/infiniband/sw/rxe/rxe.c
> +++ b/drivers/infiniband/sw/rxe/rxe.c
> @@ -73,6 +73,13 @@ static void rxe_init_device_param(struct rxe_dev *rxe)
> rxe->ndev->dev_addr);
>
> rxe->max_ucontext   = RXE_MAX_UCONTEXT;
> +
> +   if (IS_ENABLED(CONFIG_INFINIBAND_ON_DEMAND_PAGING)) {
> +   rxe->attr.kernel_cap_flags |= IBK_ON_DEMAND_PAGING;
> +
> +   /* IB_ODP_SUPPORT_IMPLICIT is not supported right now. */
> +   rxe->attr.odp_caps.general_caps |= IB_ODP_SUPPORT;
> +   }
>  }
>
>  /* initialize port attributes */
> diff --git a/drivers/infiniband/sw/rxe/rxe_loc.h 
> b/drivers/infiniband/sw/rxe/rxe_loc.h
> index 0f8cb9e38cc9..03b4078b90a3 100644
> --- a/drivers/infiniband/sw/rxe/rxe_loc.h
> +++ b/drivers/infiniband/sw/rxe/rxe_loc.h
> @@ -64,6 +64,7 @@ int rxe_mmap(struct ib_ucontext *context, struct 
> vm_area_struct *vma);
>
>  /* rxe_mr.c */
>  u8 rxe_get_next_key(u32 last_key);
> +void rxe_mr_init(int access, struct rxe_mr *mr);
>  void rxe_mr_init_dma(struct rxe_pd *pd, int access, struct rxe_mr *mr);
>  int rxe_mr_init_user(struct rxe_pd *pd, u64 start, u64 length, u64 iova,
>  int access, struct rxe_mr *mr);
> @@ -188,4 +189,8 @@ static inline unsigned int wr_opcode_mask(int opcode, 
> struct rxe_qp *qp)
> return rxe_wr_opcode_info[opcode].mask[qp->ibqp.qp_type];
>  }
>
> +/* rxe_odp.c */
> +int rxe_create_user_odp_mr(struct ib_pd *pd, u64 start, u64 length, u64 iova,
> +  int access_flags, struct rxe_mr *mr);
> +
>  #endif /* RXE_LOC_H */
> diff --git a/drivers/infiniband/sw/rxe/rxe_mr.c 
> b/drivers/infiniband/sw/rxe/rxe_mr.c
> index 814116ec4778..0ae72a4516be 100644
> --- a/drivers/infiniband/sw/rxe/rxe_mr.c
> +++ b/drivers/infiniband/sw/rxe/rxe_mr.c
> @@ -48,7 +48,7 @@ int mr_check_range(struct rxe_mr *mr, u64 iova, size_t 
> length)
> | IB_ACCESS_REMOTE_WRITE\
> | IB_ACCESS_REMOTE_ATOMIC)
>
> -static void rxe_mr_init(int access, struct rxe_mr *mr)
> +void rxe_mr_init(int access, struct rxe_mr *mr)
>  {
> u32 lkey = mr->elem.index << 8 | rxe_get_next_key(-1);
> u32 rkey = (access & IB_ACCESS_REMOTE) ? lkey : 0;
> @@ -438,7 +438,10 @@ int copy_data(
> if (bytes > 0) {
> iova = sge->addr + offset;
>
> -   err = rxe_mr_copy(mr, iova, addr, bytes, dir);
> +   if (mr->odp_enabled)
> +   err = -EOPNOTSUPP;
> +   else
> +   err = rxe_mr_copy(mr, iova, addr, bytes, dir);
> if (err)
> goto err2;
>
> diff --git a/drivers/infiniband/sw/rxe/rxe_odp.c 
> b/drivers/infiniband/sw/rxe/rxe_odp.c
> index 0f702787a66e..1f6930ba714c 100644
> --- a/drivers/infiniband/sw/rxe/rxe_odp.c
> +++ b/drivers/infiniband/sw/rxe/rxe_odp.c
> @@ -5,6 +5,8 @@
>
>  #include 
>
> +#include "rxe.h"
> +
>  bool rxe_ib_invalidate_range(struct mmu_interval_notifier *mni,
>  const struct mmu_notifier_range *range,
>  unsigned long cur_seq)
> @@ -32,3 +34,81 @@ bool rxe_ib_invalidate_range(struct mmu_interval_notifier 
> *mni,
>  const struct mmu_interval_notifier_ops rxe_mn_ops = {
> .invalidate = rxe_ib_invalidate_range,
>  };
> +
> +#define RXE_PAGEFAULT_RDONLY BIT(1)
> +#define RXE_PAGEFAULT_SNAPSHOT BIT(2)
> +static int rxe_odp_do_pagefault(struct rxe_mr *mr, u64 user_va, int bcnt, 
> u32 flags)
> +{
> +   int np;
> +  

Re: [PATCH] xfs: fail dax mount if reflink is enabled on a partition

2022-09-08 Thread Shiyang Ruan




在 2022/8/4 8:51, Darrick J. Wong 写道:

On Wed, Aug 03, 2022 at 06:47:24AM +, ruansy.f...@fujitsu.com wrote:


...



BTW, since these patches (dax + THIS + pmem-unbind) are
waiting to be merged, is it time to think about "removing the
experimental tag" again?  :)


It's probably time to take up that question again.

Yesterday I tried running generic/470 (aka the MAP_SYNC test) and it
didn't succeed because it sets up dmlogwrites atop dmthinp atop pmem,
and at least one of those dm layers no longer allows fsdax pass-through,
so XFS silently turned mount -o dax into -o dax=never. :(


Hi Darrick,

I tried generic/470 but it didn't run:
 [not run] Cannot use thin-pool devices on DAX capable block devices.

Did you modify the _require_dm_target() in common/rc?  I added thin-pool
to not to check dax capability:

   case $target in
   stripe|linear|log-writes|thin-pool)  # add thin-pool here
   ;;

then the case finally ran and it silently turned off dax as you said.

Are the steps for reproduction correct? If so, I will continue to
investigate this problem.


Ah, yes, I did add thin-pool to that case statement.  Sorry I forgot to
mention that.  I suspect that the removal of dm support for pmem is
going to force us to completely redesign this test.  I can't really
think of how, though, since there's no good way that I know of to gain a
point-in-time snapshot of a pmem device.


Hi Darrick,

  > removal of dm support for pmem
I think here we are saying about xfstest who removed the support, not
kernel?

I found some xfstests commits:
fc7b3903894a6213c765d64df91847f4460336a2  # common/rc: add the restriction.
fc5870da485aec0f9196a0f2bed32f73f6b2c664  # generic/470: use thin-pool

So, this case was never able to run since the second commit?  (I didn't
notice the not run case.  I thought it was expected to be not run.)

And according to the first commit, the restriction was added because
some of dm devices don't support dax.  So my understanding is: we should
redesign the case to make the it work, and firstly, we should add dax
support for dm devices in kernel.


dm devices used to have fsdax support; I think Christoph is actively
removing (or already has removed) all that support.


In addition, is there any other testcase has the same problem?  so that
we can deal with them together.


The last I checked, there aren't any that require MAP_SYNC or pmem aside
from g/470 and the three poison notification tests that you sent a few
days ago.

--D



Hi Darrick, Brian

I made a little investigation on generic/470.

This case was able to run before introducing thin-pool[1], but since 
that, it became 'Failed'/'Not Run' because thin-pool does not support 
DAX.  I have checked the log of thin-pool, it never supports DAX.  And, 
it's not someone has removed the fsdax support.  So, I think it's not 
correct to bypass the requirement conditions by adding 'thin-pool' to 
_require_dm_target().


As far as I known, to prevent out-of-order replay of dm-log-write, 
thin-pool was introduced (to provide discard zeroing).  Should we solve 
the 'out-of-order replay' issue instead of avoiding it by thin-pool? @Brian


Besides, since it's not a fsdax problem, I think there is nothing need 
to be fixed in fsdax.  I'd like to help it solved, but I'm still 
wondering if we could back to the original topic("Remove Experimental 
Tag") firstly? :)



[1] fc5870da485aec0f9196a0f2bed32f73f6b2c664 generic/470: use thin 
volume for dmlogwrites target device



--
Thanks,
Ruan.





Re: [RFC PATCH 0/7] RDMA/rxe: On-Demand Paging on SoftRoCE

2022-09-08 Thread matsuda-dais...@fujitsu.com
On Thu, Sep 8, 2022 5:41 PM Zhu Yanjun wrote:
> > [How to test ODP?]
> > There are only a few resources available for testing. pyverbs testcases in
> > rdma-core and perftest[7] are recommendable ones. Note that you may have to
> > build perftest from upstream since older versions do not handle ODP
> > capabilities correctly.
> 
> ibv_rc_pingpong can also test the odp feature.

This may be the easiest way to try the feature.

> 
> Please add rxe odp test cases in rdma-core.

This RFC implementation is functionally a subset of mlx5.
So, basically we can use the existing one(tests/test_odp.py).
I'll add new cases when I make them for additional testing.

Thanks,
Daisuke Matsuda

> 
> Thanks a lot.
> Zhu Yanjun






Re: [RFC PATCH 0/7] RDMA/rxe: On-Demand Paging on SoftRoCE

2022-09-08 Thread Zhu Yanjun
On Wed, Sep 7, 2022 at 10:44 AM Daisuke Matsuda
 wrote:
>
> Hi everyone,
>
> This patch series implements the On-Demand Paging feature on SoftRoCE(rxe)
> driver, which has been available only in mlx5 driver[1] so far.
>
> [Overview]
> When applications register a memory region(MR), RDMA drivers normally pin
> pages in the MR so that physical addresses are never changed during RDMA
> communication. This requires the MR to fit in physical memory and
> inevitably leads to memory pressure. On the other hand, On-Demand Paging
> (ODP) allows applications to register MRs without pinning pages. They are
> paged-in when the driver requires and paged-out when the OS reclaims. As a
> result, it is possible to register a large MR that does not fit in physical
> memory without taking up so much physical memory.
>
> [Why to add this feature?]
> We, Fujitsu, have contributed to RDMA with a view to using it with
> persistent memory. Persistent memory can host a filesystem that allows
> applications to read/write files directly without involving page cache.
> This is called FS-DAX(filesystem direct access) mode. There is a problem
> that data on DAX-enabled filesystem cannot be duplicated with software RAID
> or other hardware methods. Data replication with RDMA, which features
> high-speed connections, is the best solution for the problem.
>
> However, there is a known issue that hinders using RDMA with FS-DAX. When
> RDMA operations to a file and update of the file metadata are processed
> concurrently on the same node, illegal memory accesses can be executed,
> disregarding the updated metadata. This is because RDMA operations do not
> go through page cache but access data directly. There was an effort[2] to
> solve this problem, but it was rejected in the end. Though there is no
> general solution available, it is possible to work around the problem using
> the ODP feature that has been available only in mlx5. ODP enables drivers
> to update metadata before processing RDMA operations.
>
> We have enhanced the rxe to expedite the usage of persistent memory. Our
> contribution to rxe includes RDMA Atomic write[3] and RDMA Flush[4]. With
> them being merged to rxe along with ODP, an environment will be ready for
> developers to create and test software for RDMA with FS-DAX. There is a
> library(librpma)[5] being developed for this purpose. This environment
> can be used by anybody without any special hardware but an ordinary
> computer with a normal NIC though it is inferior to hardware
> implementations in terms of performance.
>
> [Design considerations]
> ODP has been available only in mlx5, but functions and data structures
> that can be used commonly are provided in ib_uverbs(infiniband/core). The
> interface is heavily dependent on HMM infrastructure[6], and this patchset
> use them as much as possible. While mlx5 has both Explicit and Implicit ODP
> features along with prefetch feature, this patchset implements the Explicit
> ODP feature only.
>
> As an important change, it is necessary to convert triple tasklets
> (requester, responder and completer) to workqueues because they must be
> able to sleep in order to trigger page fault before accessing MRs. I did a
> test shown in the 2nd patch and found that the change makes the latency
> higher while improving the bandwidth. Though it may be possible to create a
> new independent workqueue for page fault execution, it is a not very
> sensible solution since the tasklets have to busy-wait its completion in
> that case.
>
> If responder and completer sleep, it becomes more likely that packet drop
> occurs because of overflow in receiver queue. There are multiple queues
> involved, but, as SoftRoCE uses UDP, the most important one would be the
> UDP buffers. The size can be configured in net.core.rmem_default and
> net.core.rmem_max sysconfig parameters. Users should change these values in
> case of packet drop, but page fault would be typically not so long as to
> cause the problem.
>
> [How does ODP work?]
> "struct ib_umem_odp" is used to manage pages. It is created for each
> ODP-enabled MR on its registration. This struct holds a pair of arrays
> (dma_list/pfn_list) that serve as a driver page table. DMA addresses and
> PFNs are stored in the driver page table. They are updated on page-in and
> page-out, both of which use the common interface in ib_uverbs.
>
> Page-in can occur when requester, responder or completer access an MR in
> order to process RDMA operations. If they find that the pages being
> accessed are not present on physical memory or requisite permissions are
> not set on the pages, they provoke page fault to make pages present with
> proper permissions and at the same time update the driver page table. After
> confirming the presence of the pages, they execute memory access such as
> read, write or atomic operations.
>
> Page-out is triggered by page reclaim or filesystem events (e.g. metadata
> update of a file that is being used as an MR). When 

Re: [RFC PATCH 6/7] RDMA/rxe: Add support for Send/Recv/Write/Read operations with ODP

2022-09-08 Thread Leon Romanovsky
On Wed, Sep 07, 2022 at 11:43:04AM +0900, Daisuke Matsuda wrote:
> rxe_mr_copy() is used widely to copy data to/from a user MR. requester uses
> it to load payloads of requesting packets; responder uses it to process
> Send, Write, and Read operaetions; completer uses it to copy data from
> response packets of Read and Atomic operations to a user MR.
> 
> Allow these operations to be used with ODP by adding a counterpart function
> rxe_odp_mr_copy(). It is comprised of the following steps:
>  1. Check the driver page table(umem_odp->dma_list) to see if pages being
> accessed are present with appropriate permission.
>  2. If necessary, trigger page fault to map the pages.
>  3. Convert their user space addresses to kernel logical addresses using
> PFNs in the driver page table(umem_odp->pfn_list).
>  4. Execute data copy fo/from the pages.
> 
> umem_mutex is used to ensure that dma_list (an array of addresses of an MR)
> is not changed while it is checked and that mapped pages are not
> invalidated before data copy completes.
> 
> Signed-off-by: Daisuke Matsuda 
> ---
>  drivers/infiniband/sw/rxe/rxe.c  |  10 ++
>  drivers/infiniband/sw/rxe/rxe_loc.h  |   2 +
>  drivers/infiniband/sw/rxe/rxe_mr.c   |   2 +-
>  drivers/infiniband/sw/rxe/rxe_odp.c  | 173 +++
>  drivers/infiniband/sw/rxe/rxe_resp.c |   6 +-
>  5 files changed, 190 insertions(+), 3 deletions(-)

<...>

> +/* umem mutex is always locked when returning from this function. */
> +static int rxe_odp_map_range(struct rxe_mr *mr, u64 iova, int length, u32 
> flags)
> +{
> + struct ib_umem_odp *umem_odp = to_ib_umem_odp(mr->umem);
> + const int max_tries = 3;
> + int cnt = 0;
> +
> + int err;
> + u64 perm;
> + bool need_fault;
> +
> + if (unlikely(length < 1))
> + return -EINVAL;
> +
> + perm = ODP_READ_ALLOWED_BIT;
> + if (!(flags & RXE_PAGEFAULT_RDONLY))
> + perm |= ODP_WRITE_ALLOWED_BIT;
> +
> + mutex_lock(_odp->umem_mutex);
> +
> + /*
> +  * A successful return from rxe_odp_do_pagefault() does not guarantee
> +  * that all pages in the range became present. Recheck the DMA address
> +  * array, allowing max 3 tries for pagefault.
> +  */
> + while ((need_fault = rxe_is_pagefault_neccesary(umem_odp,
> + iova, length, perm))) {
> +
> + if (cnt >= max_tries)
> + break;
> +
> + mutex_unlock(_odp->umem_mutex);
> +
> + /* rxe_odp_do_pagefault() locks the umem mutex. */

Maybe it is correct and safe to release lock in the middle, but it is
not clear. The whole pattern of taking lock in one function and later
releasing it in another doesn't look right to me.

Thanks



Re: [PATCH -next] memregion: Add arch_flush_memregion() interface

2022-09-08 Thread Dan Williams
Borislav Petkov wrote:
> On Wed, Sep 07, 2022 at 09:52:17AM -0700, Dan Williams wrote:
> > To be clear nfit stuff and CXL does run in guests, but they do not
> > support secure-erase in a guest.
> > 
> > However, the QEMU CXL enabling is building the ability to do *guest
> > physical* address space management, but in that case the driver can be
> > paravirtualized to realize that it is not managing host-physical address
> > space and does not need to flush caches. That will need some indicator
> > to differentiate virtual CXL memory expanders from assigned devices.
> 
> Sounds to me like that check should be improved later to ask
> whether the kernel is managing host-physical address space, maybe
> arch_flush_memregion() should check whether the address it is supposed
> to flush is host-physical and exit early if not...

Even though I raised the possibility of guest passthrough of a CXL
memory expander, I do not think it could work in practice without it
being a gigantic security nightmare. So it is probably safe to just do
the hypervisor check and assume that there's no such thing as guest
management of host physical address space.