Re: [net-next v1 13/16] tcp: RX path for devmem TCP

2023-12-08 Thread Mina Almasry
On Fri, Dec 8, 2023 at 9:55 AM David Ahern  wrote:
>
> On 12/7/23 5:52 PM, Mina Almasry wrote:
> > In tcp_recvmsg_locked(), detect if the skb being received by the user
> > is a devmem skb. In this case - if the user provided the MSG_SOCK_DEVMEM
> > flag - pass it to tcp_recvmsg_devmem() for custom handling.
> >
> > tcp_recvmsg_devmem() copies any data in the skb header to the linear
> > buffer, and returns a cmsg to the user indicating the number of bytes
> > returned in the linear buffer.
> >
> > tcp_recvmsg_devmem() then loops over the unaccessible devmem skb frags,
> > and returns to the user a cmsg_devmem indicating the location of the
> > data in the dmabuf device memory. cmsg_devmem contains this information:
> >
> > 1. the offset into the dmabuf where the payload starts. 'frag_offset'.
> > 2. the size of the frag. 'frag_size'.
> > 3. an opaque token 'frag_token' to return to the kernel when the buffer
> > is to be released.
> >
> > The pages awaiting freeing are stored in the newly added
> > sk->sk_user_pages, and each page passed to userspace is get_page()'d.
> > This reference is dropped once the userspace indicates that it is
> > done reading this page.  All pages are released when the socket is
> > destroyed.
> >
> > Signed-off-by: Willem de Bruijn 
> > Signed-off-by: Kaiyuan Zhang 
> > Signed-off-by: Mina Almasry 
> >
> > ---
> >
> > Changes in v1:
> > - Added dmabuf_id to dmabuf_cmsg (David/Stan).
> > - Devmem -> dmabuf (David).
> > - Change tcp_recvmsg_dmabuf() check to skb->dmabuf (Paolo).
> > - Use __skb_frag_ref() & napi_pp_put_page() for refcounting (Yunsheng).
> >
> > RFC v3:
> > - Fixed issue with put_cmsg() failing silently.
> >
>
> What happens if a retransmitted packet is received or an rx window is
> closed and a probe is received where the kernel drops the skb - is the
> iov reference(s) in the skb returned to the pool by the stack and ready
> for use again?

When an skb is dropped, skb_frag_unref() is called on the frags, which
calls napi_pp_put_page(), drops the references, and the iov is
recycled, yes.

-- 
Thanks,
Mina



Re: [net-next v1 13/16] tcp: RX path for devmem TCP

2023-12-08 Thread David Ahern
On 12/7/23 5:52 PM, Mina Almasry wrote:
> In tcp_recvmsg_locked(), detect if the skb being received by the user
> is a devmem skb. In this case - if the user provided the MSG_SOCK_DEVMEM
> flag - pass it to tcp_recvmsg_devmem() for custom handling.
> 
> tcp_recvmsg_devmem() copies any data in the skb header to the linear
> buffer, and returns a cmsg to the user indicating the number of bytes
> returned in the linear buffer.
> 
> tcp_recvmsg_devmem() then loops over the unaccessible devmem skb frags,
> and returns to the user a cmsg_devmem indicating the location of the
> data in the dmabuf device memory. cmsg_devmem contains this information:
> 
> 1. the offset into the dmabuf where the payload starts. 'frag_offset'.
> 2. the size of the frag. 'frag_size'.
> 3. an opaque token 'frag_token' to return to the kernel when the buffer
> is to be released.
> 
> The pages awaiting freeing are stored in the newly added
> sk->sk_user_pages, and each page passed to userspace is get_page()'d.
> This reference is dropped once the userspace indicates that it is
> done reading this page.  All pages are released when the socket is
> destroyed.
> 
> Signed-off-by: Willem de Bruijn 
> Signed-off-by: Kaiyuan Zhang 
> Signed-off-by: Mina Almasry 
> 
> ---
> 
> Changes in v1:
> - Added dmabuf_id to dmabuf_cmsg (David/Stan).
> - Devmem -> dmabuf (David).
> - Change tcp_recvmsg_dmabuf() check to skb->dmabuf (Paolo).
> - Use __skb_frag_ref() & napi_pp_put_page() for refcounting (Yunsheng).
> 
> RFC v3:
> - Fixed issue with put_cmsg() failing silently.
> 

What happens if a retransmitted packet is received or an rx window is
closed and a probe is received where the kernel drops the skb - is the
iov reference(s) in the skb returned to the pool by the stack and ready
for use again?



Re: [net-next v1 13/16] tcp: RX path for devmem TCP

2023-12-08 Thread kernel test robot
Hi Mina,

kernel test robot noticed the following build errors:

[auto build test ERROR on net-next/main]

url:
https://github.com/intel-lab-lkp/linux/commits/Mina-Almasry/net-page_pool-factor-out-releasing-DMA-from-releasing-the-page/20231208-085531
base:   net-next/main
patch link:
https://lore.kernel.org/r/20231208005250.2910004-14-almasrymina%40google.com
patch subject: [net-next v1 13/16] tcp: RX path for devmem TCP
config: alpha-defconfig 
(https://download.01.org/0day-ci/archive/20231208/202312082353.lfkttexo-...@intel.com/config)
compiler: alpha-linux-gcc (GCC) 13.2.0
reproduce (this is a W=1 build): 
(https://download.01.org/0day-ci/archive/20231208/202312082353.lfkttexo-...@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot 
| Closes: 
https://lore.kernel.org/oe-kbuild-all/202312082353.lfkttexo-...@intel.com/

All errors (new ones prefixed by >>):

   net/ipv4/tcp.c: In function 'tcp_recvmsg_dmabuf':
>> net/ipv4/tcp.c:2348:57: error: 'SO_DEVMEM_LINEAR' undeclared (first use in 
>> this function)
2348 | err = put_cmsg(msg, SOL_SOCKET, 
SO_DEVMEM_LINEAR,
 | 
^~~~
   net/ipv4/tcp.c:2348:57: note: each undeclared identifier is reported only 
once for each function it appears in
>> net/ipv4/tcp.c:2411:48: error: 'SO_DEVMEM_DMABUF' undeclared (first use in 
>> this function)
2411 |SO_DEVMEM_DMABUF,
 |^~~~


vim +/SO_DEVMEM_LINEAR +2348 net/ipv4/tcp.c

  2306  
  2307  /* On error, returns the -errno. On success, returns number of bytes 
sent to the
  2308   * user. May not consume all of @remaining_len.
  2309   */
  2310  static int tcp_recvmsg_dmabuf(const struct sock *sk, const struct 
sk_buff *skb,
  2311unsigned int offset, struct msghdr *msg,
  2312int remaining_len)
  2313  {
  2314  struct dmabuf_cmsg dmabuf_cmsg = { 0 };
  2315  unsigned int start;
  2316  int i, copy, n;
  2317  int sent = 0;
  2318  int err = 0;
  2319  
  2320  do {
  2321  start = skb_headlen(skb);
  2322  
  2323  if (!skb->dmabuf) {
  2324  err = -ENODEV;
  2325  goto out;
  2326  }
  2327  
  2328  /* Copy header. */
  2329  copy = start - offset;
  2330  if (copy > 0) {
  2331  copy = min(copy, remaining_len);
  2332  
  2333  n = copy_to_iter(skb->data + offset, copy,
  2334   >msg_iter);
  2335  if (n != copy) {
  2336  err = -EFAULT;
  2337  goto out;
  2338  }
  2339  
  2340  offset += copy;
  2341  remaining_len -= copy;
  2342  
  2343  /* First a dmabuf_cmsg for # bytes copied to 
user
  2344   * buffer.
  2345   */
  2346  memset(_cmsg, 0, sizeof(dmabuf_cmsg));
  2347  dmabuf_cmsg.frag_size = copy;
> 2348  err = put_cmsg(msg, SOL_SOCKET, 
> SO_DEVMEM_LINEAR,
  2349 sizeof(dmabuf_cmsg), 
_cmsg);
  2350  if (err || msg->msg_flags & MSG_CTRUNC) {
  2351  msg->msg_flags &= ~MSG_CTRUNC;
  2352  if (!err)
  2353  err = -ETOOSMALL;
  2354  goto out;
  2355  }
  2356  
  2357  sent += copy;
  2358  
  2359  if (remaining_len == 0)
  2360  goto out;
  2361  }
  2362  
  2363  /* after that, send information of dmabuf pages through 
a
  2364   * sequence of cmsg
  2365   */
  2366  for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
  2367  skb_frag_t *frag = _shinfo(skb)->frags[i];
  2368  struct page_pool_iov *ppiov;
  2369  u64 frag_offset;
  2370  u32 user_token;
  2371  int end;
  2372  
  2373  /* skb->dmabuf should indicate that ALL the 
frags in
  2374   * this skb are dmabuf page_pool_iovs. We're 
check

[net-next v1 13/16] tcp: RX path for devmem TCP

2023-12-07 Thread Mina Almasry
In tcp_recvmsg_locked(), detect if the skb being received by the user
is a devmem skb. In this case - if the user provided the MSG_SOCK_DEVMEM
flag - pass it to tcp_recvmsg_devmem() for custom handling.

tcp_recvmsg_devmem() copies any data in the skb header to the linear
buffer, and returns a cmsg to the user indicating the number of bytes
returned in the linear buffer.

tcp_recvmsg_devmem() then loops over the unaccessible devmem skb frags,
and returns to the user a cmsg_devmem indicating the location of the
data in the dmabuf device memory. cmsg_devmem contains this information:

1. the offset into the dmabuf where the payload starts. 'frag_offset'.
2. the size of the frag. 'frag_size'.
3. an opaque token 'frag_token' to return to the kernel when the buffer
is to be released.

The pages awaiting freeing are stored in the newly added
sk->sk_user_pages, and each page passed to userspace is get_page()'d.
This reference is dropped once the userspace indicates that it is
done reading this page.  All pages are released when the socket is
destroyed.

Signed-off-by: Willem de Bruijn 
Signed-off-by: Kaiyuan Zhang 
Signed-off-by: Mina Almasry 

---

Changes in v1:
- Added dmabuf_id to dmabuf_cmsg (David/Stan).
- Devmem -> dmabuf (David).
- Change tcp_recvmsg_dmabuf() check to skb->dmabuf (Paolo).
- Use __skb_frag_ref() & napi_pp_put_page() for refcounting (Yunsheng).

RFC v3:
- Fixed issue with put_cmsg() failing silently.

---
 include/linux/socket.h|   1 +
 include/net/page_pool/helpers.h   |   9 ++
 include/net/sock.h|   2 +
 include/uapi/asm-generic/socket.h |   5 +
 include/uapi/linux/uio.h  |  10 ++
 net/ipv4/tcp.c| 190 +-
 net/ipv4/tcp_ipv4.c   |   8 ++
 7 files changed, 220 insertions(+), 5 deletions(-)

diff --git a/include/linux/socket.h b/include/linux/socket.h
index cfcb7e2c3813..fe2b9e2081bb 100644
--- a/include/linux/socket.h
+++ b/include/linux/socket.h
@@ -326,6 +326,7 @@ struct ucred {
  * plain text and require encryption
  */
 
+#define MSG_SOCK_DEVMEM 0x200  /* Receive devmem skbs as cmsg */
 #define MSG_ZEROCOPY   0x400   /* Use user data in kernel path */
 #define MSG_SPLICE_PAGES 0x800 /* Splice the pages from the iterator 
in sendmsg() */
 #define MSG_FASTOPEN   0x2000  /* Send data in TCP SYN */
diff --git a/include/net/page_pool/helpers.h b/include/net/page_pool/helpers.h
index 2d4e0a2c5620..e7e2e89d3663 100644
--- a/include/net/page_pool/helpers.h
+++ b/include/net/page_pool/helpers.h
@@ -108,6 +108,15 @@ page_pool_iov_dma_addr(const struct page_pool_iov *ppiov)
   ((dma_addr_t)page_pool_iov_idx(ppiov) << PAGE_SHIFT);
 }
 
+static inline unsigned long
+page_pool_iov_virtual_addr(const struct page_pool_iov *ppiov)
+{
+   struct dmabuf_genpool_chunk_owner *owner = page_pool_iov_owner(ppiov);
+
+   return owner->base_virtual +
+  ((unsigned long)page_pool_iov_idx(ppiov) << PAGE_SHIFT);
+}
+
 static inline struct netdev_dmabuf_binding *
 page_pool_iov_binding(const struct page_pool_iov *ppiov)
 {
diff --git a/include/net/sock.h b/include/net/sock.h
index 1d6931caf0c3..01029c855c1b 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -353,6 +353,7 @@ struct sk_filter;
   *@sk_txtime_unused: unused txtime flags
   *@ns_tracker: tracker for netns reference
   *@sk_bind2_node: bind node in the bhash2 table
+  *@sk_user_pages: xarray of pages the user is holding a reference on.
   */
 struct sock {
/*
@@ -545,6 +546,7 @@ struct sock {
struct rcu_head sk_rcu;
netns_tracker   ns_tracker;
struct hlist_node   sk_bind2_node;
+   struct xarray   sk_user_pages;
 };
 
 enum sk_pacing {
diff --git a/include/uapi/asm-generic/socket.h 
b/include/uapi/asm-generic/socket.h
index 8ce8a39a1e5f..25a2f5255f52 100644
--- a/include/uapi/asm-generic/socket.h
+++ b/include/uapi/asm-generic/socket.h
@@ -135,6 +135,11 @@
 #define SO_PASSPIDFD   76
 #define SO_PEERPIDFD   77
 
+#define SO_DEVMEM_LINEAR   98
+#define SCM_DEVMEM_LINEAR  SO_DEVMEM_LINEAR
+#define SO_DEVMEM_DMABUF   99
+#define SCM_DEVMEM_DMABUF  SO_DEVMEM_DMABUF
+
 #if !defined(__KERNEL__)
 
 #if __BITS_PER_LONG == 64 || (defined(__x86_64__) && defined(__ILP32__))
diff --git a/include/uapi/linux/uio.h b/include/uapi/linux/uio.h
index 059b1a9147f4..ad92e37699da 100644
--- a/include/uapi/linux/uio.h
+++ b/include/uapi/linux/uio.h
@@ -20,6 +20,16 @@ struct iovec
__kernel_size_t iov_len; /* Must be size_t (1003.1g) */
 };
 
+struct dmabuf_cmsg {
+   __u64 frag_offset;  /* offset into the dmabuf where the frag starts.
+*/
+   __u32 frag_size;/* size of the frag. */
+   __u32 frag_token;   /* token representing this frag for
+