Hi all,
This is v2 of series, which introduces IBNBD/IBTRS modules.
This cover letter is split on three parts:
1. Introduction, which almost repeats everything from previous cover
letters.
2. Changelog.
3. Performance measurements on linux-4.17.0-rc2 and on two different
Mellanox cards: ConnectX-2 and ConnectX-3 and CPUs: Intel and AMD.
Introduction
-
IBTRS (InfiniBand Transport) is a reliable high speed transport library
which allows for establishing connection between client and server
machines via RDMA. It is optimized to transfer (read/write) IO blocks
in the sense that it follows the BIO semantics of providing the
possibility to either write data from a scatter-gather list to the
remote side or to request ("read") data transfer from the remote side
into a given set of buffers.
IBTRS is multipath capalbdke and provides I/O fail-over and load-balancing
functionality, i.e. in IBTRS terminology, an IBTRS path is a set of RDMA
CMs and particular path is selected according to the load-balancing policy.
IBNBD (InfiniBand Network Block Device) is a pair of kernel modules
(client and server) that allow for remote access of a block device on
the server over IBTRS protocol. After being mapped, the remote block
devices can be accessed on the client side as local block devices.
Internally IBNBD uses IBTRS as an RDMA transport library.
Why?
- IBNBD/IBTRS is developed in order to map thin provisioned volumes,
thus internal protocol is simple.
- IBTRS was developed as an independent RDMA transport library, which
supports fail-over and load-balancing policies using multipath, thus
it can be used for any other IO needs rather than only for block
device.
- IBNBD/IBTRS is faster than NVME over RDMA.
Old comparison results:
https://www.spinics.net/lists/linux-rdma/msg48799.html
New comparison results: see performance measurements section below.
Key features of IBTRS transport library and IBNBD block device:
o High throughput and low latency due to:
- Only two RDMA messages per IO.
- IMM InfiniBand messages on responses to reduce round trip latency.
- Simplified memory management: memory allocation happens once on
server side when IBTRS session is established.
o IO fail-over and load-balancing by using multipath. According to
our test loads additional path brings ~20% of bandwidth.
o Simple configuration of IBNBD:
- Server side is completely passive: volumes do not need to be
explicitly exported.
- Only IB port GID and device path needed on client side to map
a block device.
- A device is remapped automatically i.e. after storage reboot.
Commits for kernel can be found here:
https://github.com/profitbricks/ibnbd/commits/linux-4.17-rc2
The out-of-tree modules are here:
https://github.com/profitbricks/ibnbd/
Vault 2017 presentation:
http://events.linuxfoundation.org/sites/events/files/slides/IBNBD-Vault-2017.pdf
Changelog
-
v2:
o IBNBD:
- No legacy request IO mode, only MQ is left.
o IBTRS:
- No FMR registration, only FR is left.
- By default memory is always registered for the sake of the security,
i.e. by default no pd is created with IB_PD_UNSAFE_GLOBAL_RKEY.
- Server side (target) always does memory registration and exchanges
MRs dma addresses with client for direct writes from client side.
- Client side (initiator) has `noreg_cnt` module option, which
specifies
sg number, from which read IO should be registered. By default 0
is set, i.e. always register memory for read IOs. (IBTRS protocol
does not require registration for writes, which always go directly
to server memory).
- Proper DMA sync with ib_dma_sync_single_for_(cpu|device) calls.
- Do signalled IB_WR_LOCAL_INV.
- Avoid open-coding of string conversion to IPv4/6 sockaddr,
inet_pton_with_scope() is used instead.
- Introduced block device namespaces configuration on server side
(target) to avoid security gap in not trusted environment, when
client can map a block device which does not belong to him.
When device namespaces are enabled on server side, server opens
device using client's session name in the device path, where
session name is a random token, e.g. GUID. If server is configured
to find device namespaces in a folder /run/ibnbd-guid/, then
request to map device 'sda1' from client with session 'A' (or any
token) will be resolved by path /run/ibnbd-guid/A/sda1.
- README is extended with description of IBTRS and IBNBD protocol,
e.g. how IB IMM field is used to acknowledge IO requests or
heartbeats.
- IBTRS/IBNBD client and server modules are registered as devices in
the kernel in order to have all sysfs configuration entries under