Add documentation outlining the usage and details of devmem TCP.
Signed-off-by: Mina Almasry
Reviewed-by: Bagas Sanjaya
---
v16:
- Add documentation on unbinding the NIC from dmabuf (Donald).
- Add note that any dmabuf should work (Donald).
v9:
https://lore.kernel.org/netdev/[email protected]/
- Bagas doc suggestions.
v8:
- Applied docs suggestions (Randy). Thanks!
v7:
- Applied docs suggestions (Jakub).
v2:
- Missing spdx (simon)
- add to index.rst (simon)
---
Documentation/networking/devmem.rst | 269
Documentation/networking/index.rst | 1 +
2 files changed, 270 insertions(+)
create mode 100644 Documentation/networking/devmem.rst
diff --git a/Documentation/networking/devmem.rst
b/Documentation/networking/devmem.rst
new file mode 100644
index 0..417fc977844ee
--- /dev/null
+++ b/Documentation/networking/devmem.rst
@@ -0,0 +1,269 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=
+Device Memory TCP
+=
+
+
+Intro
+=
+
+Device memory TCP (devmem TCP) enables receiving data directly into device
+memory (dmabuf). The feature is currently implemented for TCP sockets.
+
+
+Opportunity
+---
+
+A large number of data transfers have device memory as the source and/or
+destination. Accelerators drastically increased the prevalence of such
+transfers. Some examples include:
+
+- Distributed training, where ML accelerators, such as GPUs on different hosts,
+ exchange data.
+
+- Distributed raw block storage applications transfer large amounts of data
with
+ remote SSDs. Much of this data does not require host processing.
+
+Typically the Device-to-Device data transfers in the network are implemented as
+the following low-level operations: Device-to-Host copy, Host-to-Host network
+transfer, and Host-to-Device copy.
+
+The flow involving host copies is suboptimal, especially for bulk data
transfers,
+and can put significant strains on system resources such as host memory
+bandwidth and PCIe bandwidth.
+
+Devmem TCP optimizes this use case by implementing socket APIs that enable
+the user to receive incoming network packets directly into device memory.
+
+Packet payloads go directly from the NIC to device memory.
+
+Packet headers go to host memory and are processed by the TCP/IP stack
+normally. The NIC must support header split to achieve this.
+
+Advantages:
+
+- Alleviate host memory bandwidth pressure, compared to existing
+ network-transfer + device-copy semantics.
+
+- Alleviate PCIe bandwidth pressure, by limiting data transfer to the lowest
+ level of the PCIe tree, compared to the traditional path which sends data
+ through the root complex.
+
+
+More Info
+-
+
+ slides, video
+https://netdevconf.org/0x17/sessions/talk/device-memory-tcp.html
+
+ patchset
+[RFC PATCH v6 00/12] Device Memory TCP
+
https://lore.kernel.org/netdev/[email protected]/
+
+
+Interface
+=
+
+
+Example
+---
+
+tools/testing/selftests/net/ncdevmem.c:do_server shows an example of setting up
+the RX path of this API.
+
+
+NIC Setup
+-
+
+Header split, flow steering, & RSS are required features for devmem TCP.
+
+Header split is used to split incoming packets into a header buffer in host
+memory, and a payload buffer in device memory.
+
+Flow steering & RSS are used to ensure that only flows targeting devmem land on
+an RX queue bound to devmem.
+
+Enable header split & flow steering::
+
+ # enable header split
+ ethtool -G eth1 tcp-data-split on
+
+
+ # enable flow steering
+ ethtool -K eth1 ntuple on
+
+Configure RSS to steer all traffic away from the target RX queue (queue 15 in
+this example)::
+
+ ethtool --set-rxfh-indir eth1 equal 15
+
+
+The user must bind a dmabuf to any number of RX queues on a given NIC using
+the netlink API::
+
+ /* Bind dmabuf to NIC RX queue 15 */
+ struct netdev_queue *queues;
+ queues = malloc(sizeof(*queues) * 1);
+
+ queues[0]._present.type = 1;
+ queues[0]._present.idx = 1;
+ queues[0].type = NETDEV_RX_QUEUE_TYPE_RX;
+ queues[0].idx = 15;
+
+ *ys = ynl_sock_create(&ynl_netdev_family, &yerr);
+
+ req = netdev_bind_rx_req_alloc();
+ netdev_bind_rx_req_set_ifindex(req, 1 /* ifindex */);
+ netdev_bind_rx_req_set_dmabuf_fd(req, dmabuf_fd);
+ __netdev_bind_rx_req_set_queues(req, queues, n_queue_index);
+
+ rsp = netdev_bind_rx(*ys, req);
+
+ dmabuf_id = rsp->dmabuf_id;
+
+
+The netlink API returns a dmabuf_id: a unique ID that refers to this dmabuf
+that has been bound.
+
+The user can unbind the dmabuf from the netdevice by closing the netlink socket
+that established the binding. We do this so that the binding is automatically
+unbound even if the userspace process crashes.
+
+Note that any reasonably well-behaved dmabuf from any exporter should work with
+devmem TCP, even if the dmabuf is n