Re: [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)
Hi Doug, thanks for the feedback. You read the cover letter correctly: our transport library implements multipath (load balancing and failover) on top of RDMA API. Its name "IBTRS" is slightly misleading in that regard: it can sit on top of ROCE as well. The library allows for "bundling" multiple rdma "paths" (source addr - destination addr pair) into one "session". So our session consists of one or more paths and each path under the hood consists of as many QPs (each connecting source with destination) as there are CPUs on the client system. The user load (In our case IBNBD is a block device and generates some block requests) is load-balanced on per cpu-basis. I understand, this is something very different to what smc-r is doing. Am I right? Do you know what stage MP-RDMA development currently is? Best, Danil Kipnis. P.S. Sorry for the duplicate if any, first mail was returned cause of html. On Thu, Feb 8, 2018 at 7:10 PM Bart Van Assche wrote: > > On Thu, 2018-02-08 at 18:38 +0100, Danil Kipnis wrote: > > thanks for the link to the article. To the best of my understanding, > > the guys suggest to authenticate the devices first and only then > > authenticate the users who use the devices in order to get access to a > > corporate service. They also mention in the presentation the current > > trend of moving corporate services into the cloud. But I think this is > > not about the devices from which that cloud is build of. Isn't a cloud > > first build out of devices connected via IB and then users (and their > > devices) are provided access to the services of that cloud as a whole? > > If a malicious user already plugged his device into an IB switch of a > > cloud internal infrastructure, isn't it game over anyway? Can't he > > just take the hard drives instead of mapping them? > > Hello Danil, > > It seems like we each have been focussing on different aspects of the article. > The reason I referred to that article is because I read the following in > that article: "Unlike the conventional perimeter security model, BeyondCorp > doesn’t gate access to services and tools based on a user’s physical location > or the originating network [ ... ] The zero trust architecture spells trouble > for traditional attacks that rely on penetrating a tough perimeter to waltz > freely within an open internal network." Suppose e.g. that an organization > decides to use RoCE or iWARP for connectivity between block storage initiator > systems and block storage target systems and that it has a single company- > wide Ethernet network. If the target system does not restrict access based > on initiator IP address then any penetrator would be able to access all the > block devices exported by the target after a SoftRoCE or SoftiWARP initiator > driver has been loaded. If the target system however restricts access based > on the initiator IP address then that would make it harder for a penetrator > to access the exported block storage devices. Instead of just penetrating the > network access, IP address spoofing would have to be used or access would > have to be obtained to a system that has been granted access to the target > system. > > Thanks, > > Bart. > > -- Danil Kipnis Linux Kernel Developer
Re: [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)
On Thu, 2018-02-08 at 18:38 +0100, Danil Kipnis wrote: > thanks for the link to the article. To the best of my understanding, > the guys suggest to authenticate the devices first and only then > authenticate the users who use the devices in order to get access to a > corporate service. They also mention in the presentation the current > trend of moving corporate services into the cloud. But I think this is > not about the devices from which that cloud is build of. Isn't a cloud > first build out of devices connected via IB and then users (and their > devices) are provided access to the services of that cloud as a whole? > If a malicious user already plugged his device into an IB switch of a > cloud internal infrastructure, isn't it game over anyway? Can't he > just take the hard drives instead of mapping them? Hello Danil, It seems like we each have been focussing on different aspects of the article. The reason I referred to that article is because I read the following in that article: "Unlike the conventional perimeter security model, BeyondCorp doesn’t gate access to services and tools based on a user’s physical location or the originating network [ ... ] The zero trust architecture spells trouble for traditional attacks that rely on penetrating a tough perimeter to waltz freely within an open internal network." Suppose e.g. that an organization decides to use RoCE or iWARP for connectivity between block storage initiator systems and block storage target systems and that it has a single company- wide Ethernet network. If the target system does not restrict access based on initiator IP address then any penetrator would be able to access all the block devices exported by the target after a SoftRoCE or SoftiWARP initiator driver has been loaded. If the target system however restricts access based on the initiator IP address then that would make it harder for a penetrator to access the exported block storage devices. Instead of just penetrating the network access, IP address spoofing would have to be used or access would have to be obtained to a system that has been granted access to the target system. Thanks, Bart.
Re: [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)
On Wed, Feb 7, 2018 at 6:32 PM, Bart Van Assche wrote: > On Wed, 2018-02-07 at 18:18 +0100, Roman Penyaev wrote: >> So the question is: are there real life setups where >> some of the local IB network members can be untrusted? > > Hello Roman, > > You may want to read more about the latest evolutions with regard to network > security. An article that I can recommend is the following: "Google reveals > own security regime policy trusts no network, anywhere, ever" > (https://www.theregister.co.uk/2016/04/06/googles_beyondcorp_security_policy/). > > If data-centers would start deploying RDMA among their entire data centers > (maybe they are already doing this) then I think they will want to restrict > access to block devices to only those initiator systems that need it. > > Thanks, > > Bart. > > Hi Bart, thanks for the link to the article. To the best of my understanding, the guys suggest to authenticate the devices first and only then authenticate the users who use the devices in order to get access to a corporate service. They also mention in the presentation the current trend of moving corporate services into the cloud. But I think this is not about the devices from which that cloud is build of. Isn't a cloud first build out of devices connected via IB and then users (and their devices) are provided access to the services of that cloud as a whole? If a malicious user already plugged his device into an IB switch of a cloud internal infrastructure, isn't it game over anyway? Can't he just take the hard drives instead of mapping them? Thanks, Danil.
Re: [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)
On Wed, 2018-02-07 at 18:18 +0100, Roman Penyaev wrote: > So the question is: are there real life setups where > some of the local IB network members can be untrusted? Hello Roman, You may want to read more about the latest evolutions with regard to network security. An article that I can recommend is the following: "Google reveals own security regime policy trusts no network, anywhere, ever" (https://www.theregister.co.uk/2016/04/06/googles_beyondcorp_security_policy/). If data-centers would start deploying RDMA among their entire data centers (maybe they are already doing this) then I think they will want to restrict access to block devices to only those initiator systems that need it. Thanks, Bart.
Re: [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)
On Wed, Feb 7, 2018 at 5:35 PM, Christopher Lameter wrote: > On Mon, 5 Feb 2018, Bart Van Assche wrote: > >> That approach may work well for your employer but sorry I don't think this is >> sufficient for an upstream driver. I think that most users who configure a >> network storage target expect full control over which storage devices are >> exported >> and also over which clients do have and do not have access. > > Well is that actually true for IPoIB? It seems that I can arbitrarily > attach to any partition I want without access control. In many ways some > of the RDMA layers and modules are loose with security since performance > is what matters mostly and deployments occur in separate production > environments. > > We have had security issues (that not fully resolved yet) with the RDMA > RPC API for years.. So maybe lets relax on the security requirements a > bit? > Frankly speaking I do not understand the "security" about this kind of block devices and RDMA in particular. I can admit that personally I do not see the whole picture, so can someone provide the real usecase/scenario? What we have in our datacenters is trusted environment (do others exist?). You need a volume, you create it. You need to map a volume remotely - you map it. Of course there are provisioning checks, rw/ro checks, etc. But in general any IP/key checks (is that client really a "good" guy or not?) are simply useless. So the question is: are there real life setups where some of the local IB network members can be untrusted? -- Roman
Re: [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)
On Mon, 5 Feb 2018, Bart Van Assche wrote: > That approach may work well for your employer but sorry I don't think this is > sufficient for an upstream driver. I think that most users who configure a > network storage target expect full control over which storage devices are > exported > and also over which clients do have and do not have access. Well is that actually true for IPoIB? It seems that I can arbitrarily attach to any partition I want without access control. In many ways some of the RDMA layers and modules are loose with security since performance is what matters mostly and deployments occur in separate production environments. We have had security issues (that not fully resolved yet) with the RDMA RPC API for years.. So maybe lets relax on the security requirements a bit?
Re: [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)
On Wed, 2018-02-07 at 13:57 +0100, Roman Penyaev wrote: > On Tue, Feb 6, 2018 at 5:01 PM, Bart Van Assche > wrote: > > On Tue, 2018-02-06 at 14:12 +0100, Roman Penyaev wrote: > > Something else I would like to understand better is how much of the latency > > gap between NVMeOF/SRP and IBNBD can be closed without changing the wire > > protocol. Was e.g. support for immediate data present in the NVMeOF and/or > > SRP drivers used on your test setup? > > I did not get the question. IBTRS uses empty messages with only imm_data > field set to respond on IO. This is a part of the IBTRS protocol. I do > not understand how can immediate data be present in other drivers, if > those do not use it in their protocols. I am lost here. With "immediate data" I was referring to including the entire write buffer in the write PDU itself. See e.g. the enable_imm_data kernel module parameter of the ib_srp-backport driver. See also the use of SRP_DATA_DESC_IMM in the SCST ib_srpt target driver. Neither the upstream SRP initiator nor the upstream SRP target support immediate data today. However, sending that code upstream is on my to-do list. For the upstream NVMeOF initiator and target drivers, see also the call of nvme_rdma_map_sg_inline() in nvme_rdma_map_data(). > > Are you aware that the NVMeOF target driver calls page_alloc() from the hot > > path but that there are plans to > > avoid these calls in the hot path by using a caching mechanism similar to > > the SGV cache in SCST? Are you aware that a significant latency reduction > > can be achieved by changing the SCST SGV cache from a global into a per-CPU > > cache? > > No, I am not aware. That is nice, that there is a lot of room for performance > tweaks. I will definitely retest on fresh kernel once everything is done on > nvme, scst or ibtrs (especially when we get rid of fmrs and UNSAFE rkeys). Recently the functions sgl_alloc() and sgl_free() were introduced in the upstream kernel (these will be included in kernel v4.16). The NVMe target driver, LIO and several other drivers have been modified to use these functions instead of their own copy of that function. The next step is to replace these function calls by calls to functions that perform cached allocations. > > Regarding the SRP measurements: have you tried to set the > > never_register kernel module parameter to true? I'm asking this because I > > think that mode is most similar to how the IBNBD initiator driver works. > > yes, according to my notes from that link (frankly, I do not remember, > but that is what I wrote 1 year ago): > > * Where suffixes mean: > > _noreg - modules on initiator side (ib_srp, nvme_rdma) were loaded > with 'register_always=N' param > > That what you are asking, right? Not really. With register_always=Y memory registration is always used by the SRP initiator, even if the data can be coalesced into a single sg entry. With register_always=N memory registration is only performed if multiple sg entries are needed to describe the data. And with never_register=Y memory registration is not used even if multiple sg entries are needed to describe the data buffer. Thanks, Bart.
Re: [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)
Hi Sagi and all, On Mon, Feb 5, 2018 at 1:30 PM, Sagi Grimberg wrote: > Hi Roman and the team (again), replying to my own email :) > > I forgot to mention that first of all thank you for upstreaming > your work! I fully support your goal to have your production driver > upstream to minimize your maintenance efforts. I hope that my > feedback didn't came across with a different impression, that was > certainly not my intent. Well, I've just recovered from two heart attacks, which I got while reading your replies, but now I am fine, thanks :) > It would be great if you can address and/or reply to my feedback > (as well as others) and re-spin it again. Jokes aside, we would like to thank you all, guys, for valuable feedback. I got a lot of useful remarks from you Sagi and you Bart. We will try to cover them in next version and will provide up-to-date comparison results. -- Roman
Re: [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)
On Tue, Feb 6, 2018 at 5:01 PM, Bart Van Assche wrote: > On Tue, 2018-02-06 at 14:12 +0100, Roman Penyaev wrote: >> On Mon, Feb 5, 2018 at 1:16 PM, Sagi Grimberg wrote: >> > [ ... ] >> > - srp/scst comparison is really not fair having it in legacy request >> > mode. Can you please repeat it and report a bug to either linux-rdma >> > or to the scst mailing list? >> >> Yep, I can retest with mq. >> >> > - Your latency measurements are surprisingly high for a null target >> > device (even for low end nvme device actually) regardless of the >> > transport implementation. >> >> Hm, network configuration? These are results on machines dedicated >> to our team for testing in one of our datacenters. Nothing special >> in configuration. > Hello Bart, > I agree that the latency numbers are way too high for a null target device. > Last time I measured latency for the SRP protocol against an SCST target > + null block driver at the target side and ConnectX-3 adapters I measured a > latency of about 14 microseconds. That's almost 100 times less than the > measurement results in https://www.spinics.net/lists/linux-rdma/msg48799.html. Here is the following configuration of the setup: Initiator and target HW configuration: AMD Opteron 6386 SE, 64CPU, 128Gb InfiniBand: Mellanox Technologies MT26428 [ConnectX VPI PCIe 2.0 5GT/s - IB QDR / 10GigE] Also, I remember that between initiator and target there were two IB switches. Unfortunately, I can't repeat the same configuration, but will retest as soon as we get new HW. > Something else I would like to understand better is how much of the latency > gap between NVMeOF/SRP and IBNBD can be closed without changing the wire > protocol. Was e.g. support for immediate data present in the NVMeOF and/or > SRP drivers used on your test setup? I did not get the question. IBTRS uses empty messages with only imm_data field set to respond on IO. This is a part of the IBTRS protocol. I do not understand how can immediate data be present in other drivers, if those do not use it in their protocols. I am lost here. > Are you aware that the NVMeOF target driver calls page_alloc() from the hot > path but that there are plans to > avoid these calls in the hot path by using a caching mechanism similar to > the SGV cache in SCST? Are you aware that a significant latency reduction > can be achieved by changing the SCST SGV cache from a global into a per-CPU > cache? No, I am not aware. That is nice, that there is a lot of room for performance tweaks. I will definitely retest on fresh kernel once everything is done on nvme, scst or ibtrs (especially when we get rid of fmrs and UNSAFE rkeys). Maybe there are some other parameters which can be also tweaked? > Regarding the SRP measurements: have you tried to set the > never_register kernel module parameter to true? I'm asking this because I > think that mode is most similar to how the IBNBD initiator driver works. yes, according to my notes from that link (frankly, I do not remember, but that is what I wrote 1 year ago): * Where suffixes mean: _noreg - modules on initiator side (ib_srp, nvme_rdma) were loaded with 'register_always=N' param That what you are asking, right? -- Roman
Re: [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)
On Tue, 2018-02-06 at 14:12 +0100, Roman Penyaev wrote: > On Mon, Feb 5, 2018 at 1:16 PM, Sagi Grimberg wrote: > > [ ... ] > > - srp/scst comparison is really not fair having it in legacy request > > mode. Can you please repeat it and report a bug to either linux-rdma > > or to the scst mailing list? > > Yep, I can retest with mq. > > > - Your latency measurements are surprisingly high for a null target > > device (even for low end nvme device actually) regardless of the > > transport implementation. > > Hm, network configuration? These are results on machines dedicated > to our team for testing in one of our datacenters. Nothing special > in configuration. Hello Roman, I agree that the latency numbers are way too high for a null target device. Last time I measured latency for the SRP protocol against an SCST target + null block driver at the target side and ConnectX-3 adapters I measured a latency of about 14 microseconds. That's almost 100 times less than the measurement results in https://www.spinics.net/lists/linux-rdma/msg48799.html. Something else I would like to understand better is how much of the latency gap between NVMeOF/SRP and IBNBD can be closed without changing the wire protocol. Was e.g. support for immediate data present in the NVMeOF and/or SRP drivers used on your test setup? Are you aware that the NVMeOF target driver calls page_alloc() from the hot path but that there are plans to avoid these calls in the hot path by using a caching mechanism similar to the SGV cache in SCST? Are you aware that a significant latency reduction can be achieved by changing the SCST SGV cache from a global into a per-CPU cache? Regarding the SRP measurements: have you tried to set the never_register kernel module parameter to true? I'm asking this because I think that mode is most similar to how the IBNBD initiator driver works. Thanks, Bart.
Re: [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)
On Tue, 2018-02-06 at 10:44 +0100, Danil Kipnis wrote: > the configuration (which devices can be accessed by a particular > client) can happen also after the kernel target module is loaded. The > directory in is a module parameter and is fixed. It > contains for example "/ibnbd_devices/". But a particular client X > would be able to only access the devices located in the subdirectory > "/ibnbd_devices/client_x/". (The sessionname here is client_x) One can > add or remove the devices from that directory (those are just symlinks > to /dev/xxx) at any time - before or after the server module is > loaded. But you are right, we need something additional in order to be > able to specify which devices a client can access writable and which > readonly. May be another subdirectories "wr" and "ro" for each client: > those under /ibnbd_devices/client_x/ro/ can only be read by client_x > and those in /ibnbd_devices/client_x/wr/ can also be written to? Please use a standard kernel filesystem (sysfs or configfs) instead of reinventing it. Thanks, Bart.
Re: [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)
Hi Sagi, On Mon, Feb 5, 2018 at 1:16 PM, Sagi Grimberg wrote: > Hi Roman and the team, > > On 02/02/2018 04:08 PM, Roman Pen wrote: >> >> This series introduces IBNBD/IBTRS modules. >> >> IBTRS (InfiniBand Transport) is a reliable high speed transport library >> which allows for establishing connection between client and server >> machines via RDMA. > > > So its not strictly infiniband correct? This is RDMA. Original IB prefix is a bit confusing, that's true. > It is optimized to transfer (read/write) IO blocks >> >> in the sense that it follows the BIO semantics of providing the >> possibility to either write data from a scatter-gather list to the >> remote side or to request ("read") data transfer from the remote side >> into a given set of buffers. >> >> IBTRS is multipath capable and provides I/O fail-over and load-balancing >> functionality. > > > Couple of questions on your multipath implementation? > 1. What was your main objective over dm-multipath? No objections, mpath is a part of the transport ibtrs library. > 2. What was the consideration of this implementation over > creating a stand-alone bio based device node to reinject the > bio to the original block device? ibnbd and ibtrs are separate, on fail-over or load-balancing we work with IO requests inside a library. >> IBNBD (InfiniBand Network Block Device) is a pair of kernel modules >> (client and server) that allow for remote access of a block device on >> the server over IBTRS protocol. After being mapped, the remote block >> devices can be accessed on the client side as local block devices. >> Internally IBNBD uses IBTRS as an RDMA transport library. >> >> Why? >> >> - IBNBD/IBTRS is developed in order to map thin provisioned volumes, >> thus internal protocol is simple and consists of several request >> types only without awareness of underlaying hardware devices. > > > Can you explain how the protocol is developed for thin-p? What are the > essence of how its suited for it? Here I wanted to emphasize, that we do not support any HW commands, like nvme does, thus internal protocol consists of several commands. So answering on your question "how the protocol is developed for thin-p" I would put it another way around: "protocol does nothing to support real device, because all we need is to map thin-p volumes". It is just simpler. >> - IBTRS was developed as an independent RDMA transport library, which >> supports fail-over and load-balancing policies using multipath, thus >> it can be used for any other IO needs rather than only for block >> device. > > > What do you mean by "any other IO"? I mean other IO producers, not only ibnbd, since this is just a transport library. > >> - IBNBD/IBTRS is faster than NVME over RDMA. Old comparison results: >> https://www.spinics.net/lists/linux-rdma/msg48799.html >> (I retested on latest 4.14 kernel - there is no any significant >> difference, thus I post the old link). > > > That is interesting to learn. > > Reading your reference brings a couple of questions though, > - Its unclear to me how ibnbd performs reads without performing memory > registration. Is it using the global dma rkey? Yes, global rkey. WRITE: writes from client READ: writes from server > - Its unclear to me how there is a difference in noreg in writes, > because for small writes nvme-rdma never register memory (it uses > inline data). No support for that. > - Looks like with nvme-rdma you max out your iops at 1.6 MIOPs, that > seems considerably low against other reports. Can you try and explain > what was the bottleneck? This can be a potential bug and I (and the > rest of the community is interesting in knowing more details). Sure, I can try. BTW, what are other reports and numbers? > - srp/scst comparison is really not fair having it in legacy request > mode. Can you please repeat it and report a bug to either linux-rdma > or to the scst mailing list? Yep, I can retest with mq. > - Your latency measurements are surprisingly high for a null target > device (even for low end nvme device actually) regardless of the > transport implementation. Hm, network configuration? These are results on machines dedicated to our team for testing in one of our datacenters. Nothing special in configuration. > For example: > - QD=1 read latency is 648.95 for ibnbd (I assume usecs right?) which is > fairly high. on nvme-rdma its 1058 us, which means over 1 millisecond > and even 1.254 ms for srp. Last time I tested nvme-rdma read QD=1 > latency I got ~14 us. So something does not add up here. If this is > not some configuration issue, then we have serious bugs to handle.. > > - QD=16 the read latencies are > 10ms for null devices?! I'm having > troubles understanding how you were able to get such high latencies > (> 100 ms for QD>=100) What QD stands for? queue depth? This is not a queue depth, this is how many fio jo
Re: [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)
On Mon, Feb 5, 2018 at 6:20 PM, Bart Van Assche wrote: > On Mon, 2018-02-05 at 18:16 +0100, Roman Penyaev wrote: >> Everything (fio jobs, setup, etc) is given in the same link: >> >> https://www.spinics.net/lists/linux-rdma/msg48799.html >> >> at the bottom you will find links on google docs with many pages >> and archived fio jobs and scripts. (I do not remember exactly, >> one year passed, but there should be everything). >> >> Regarding smaller iodepth_batch_submit - that decreases performance. >> Once I played with that, even introduced new iodepth_batch_complete_max >> option for fio, but then I decided to stop and simply chose this >> configuration, which provides me fastest results. > > Hello Roman, > > That's weird. For which protocols did reducing iodepth_batch_submit lead > to lower performance: all the tested protocols or only some of them? Hi Bart, Seems that does not depend on protocol (when I tested it was true for nvme and ibnbd). That depends on a load. On high load (1 or few fio jobs are dedicated to each cpu, and we have 64 cpus) it turns out to be faster to wait completions for all queue for that particular block dev, instead of switching from kernel to userspace for each completed IO. But I can assure you that performance difference is very minor, it exists, but it does not change the whole picture of what you see on this google sheet. So what I tried to achieve is to squeeze everything I could, nothing more. -- Roman
Re: [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)
On Mon, Feb 5, 2018 at 7:38 PM, Bart Van Assche wrote: > On 02/05/18 08:40, Danil Kipnis wrote: >> >> It just occurred to me, that we could easily extend the interface in >> such a way that each client (i.e. each session) would have on server >> side her own directory with the devices it can access. I.e. instead of >> just "dev_search_path" per server, any client would be able to only >> access devices under /session_name. (session name >> must already be generated by each client in a unique way). This way >> one could have an explicit control over which devices can be accessed >> by which clients. Do you think that would do it? > > > Hello Danil, > > That sounds interesting to me. However, I think that approach requires to > configure client access completely before the kernel target side module is > loaded. It does not allow to configure permissions dynamically after the > kernel target module has been loaded. Additionally, I don't see how to > support attributes per (initiator, block device) pair with that approach. > LIO e.g. supports the > /sys/kernel/config/target/srpt/*/*/acls/*/lun_*/write_protect attribute. You > may want to implement similar functionality if you want to convince more > users to use IBNBD. > > Thanks, > > Bart. Hello Bart, the configuration (which devices can be accessed by a particular client) can happen also after the kernel target module is loaded. The directory in is a module parameter and is fixed. It contains for example "/ibnbd_devices/". But a particular client X would be able to only access the devices located in the subdirectory "/ibnbd_devices/client_x/". (The sessionname here is client_x) One can add or remove the devices from that directory (those are just symlinks to /dev/xxx) at any time - before or after the server module is loaded. But you are right, we need something additional in order to be able to specify which devices a client can access writable and which readonly. May be another subdirectories "wr" and "ro" for each client: those under /ibnbd_devices/client_x/ro/ can only be read by client_x and those in /ibnbd_devices/client_x/wr/ can also be written to? Thanks, Danil.
Re: [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)
On 02/05/18 08:40, Danil Kipnis wrote: It just occurred to me, that we could easily extend the interface in such a way that each client (i.e. each session) would have on server side her own directory with the devices it can access. I.e. instead of just "dev_search_path" per server, any client would be able to only access devices under /session_name. (session name must already be generated by each client in a unique way). This way one could have an explicit control over which devices can be accessed by which clients. Do you think that would do it? Hello Danil, That sounds interesting to me. However, I think that approach requires to configure client access completely before the kernel target side module is loaded. It does not allow to configure permissions dynamically after the kernel target module has been loaded. Additionally, I don't see how to support attributes per (initiator, block device) pair with that approach. LIO e.g. supports the /sys/kernel/config/target/srpt/*/*/acls/*/lun_*/write_protect attribute. You may want to implement similar functionality if you want to convince more users to use IBNBD. Thanks, Bart.
Re: [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)
On Mon, 2018-02-05 at 18:16 +0100, Roman Penyaev wrote: > Everything (fio jobs, setup, etc) is given in the same link: > > https://www.spinics.net/lists/linux-rdma/msg48799.html > > at the bottom you will find links on google docs with many pages > and archived fio jobs and scripts. (I do not remember exactly, > one year passed, but there should be everything). > > Regarding smaller iodepth_batch_submit - that decreases performance. > Once I played with that, even introduced new iodepth_batch_complete_max > option for fio, but then I decided to stop and simply chose this > configuration, which provides me fastest results. Hello Roman, That's weird. For which protocols did reducing iodepth_batch_submit lead to lower performance: all the tested protocols or only some of them? Thanks, Bart.
Re: [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)
Hi Bart, On Mon, Feb 5, 2018 at 5:58 PM, Bart Van Assche wrote: > On Mon, 2018-02-05 at 14:16 +0200, Sagi Grimberg wrote: >> - Your latency measurements are surprisingly high for a null target >>device (even for low end nvme device actually) regardless of the >>transport implementation. >> >> For example: >> - QD=1 read latency is 648.95 for ibnbd (I assume usecs right?) which is >>fairly high. on nvme-rdma its 1058 us, which means over 1 millisecond >>and even 1.254 ms for srp. Last time I tested nvme-rdma read QD=1 >>latency I got ~14 us. So something does not add up here. If this is >>not some configuration issue, then we have serious bugs to handle.. >> >> - QD=16 the read latencies are > 10ms for null devices?! I'm having >>troubles understanding how you were able to get such high latencies >>(> 100 ms for QD>=100) >> >> Can you share more information about your setup? It would really help >> us understand more. > > I would also appreciate it if more information could be provided about the > measurement results. In addition to answering Sagi's questions, would it > be possible to share the fio job that was used for measuring latency? In > https://events.static.linuxfound.org/sites/events/files/slides/Copy%20of%20IBNBD-Vault-2017-5.pdf > I found the following: > > iodepth=128 > iodepth_batch_submit=128 > > If you want to keep the pipeline full I think that you need to set the > iodepth_batch_submit parameter to a value that is much lower than iodepth. > I think that setting iodepth_batch_submit equal to iodepth will yield > suboptimal IOPS results. Jens, please correct me if I got this wrong. Sorry, Bart, I would answer here in a few words (I would like to answer in details tomorrow on Sagi's mail). Everything (fio jobs, setup, etc) is given in the same link: https://www.spinics.net/lists/linux-rdma/msg48799.html at the bottom you will find links on google docs with many pages and archived fio jobs and scripts. (I do not remember exactly, one year passed, but there should be everything). Regarding smaller iodepth_batch_submit - that decreases performance. Once I played with that, even introduced new iodepth_batch_complete_max option for fio, but then I decided to stop and simply chose this configuration, which provides me fastest results. -- Roman
Re: [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)
On Mon, 2018-02-05 at 14:16 +0200, Sagi Grimberg wrote: > - Your latency measurements are surprisingly high for a null target >device (even for low end nvme device actually) regardless of the >transport implementation. > > For example: > - QD=1 read latency is 648.95 for ibnbd (I assume usecs right?) which is >fairly high. on nvme-rdma its 1058 us, which means over 1 millisecond >and even 1.254 ms for srp. Last time I tested nvme-rdma read QD=1 >latency I got ~14 us. So something does not add up here. If this is >not some configuration issue, then we have serious bugs to handle.. > > - QD=16 the read latencies are > 10ms for null devices?! I'm having >troubles understanding how you were able to get such high latencies >(> 100 ms for QD>=100) > > Can you share more information about your setup? It would really help > us understand more. I would also appreciate it if more information could be provided about the measurement results. In addition to answering Sagi's questions, would it be possible to share the fio job that was used for measuring latency? In https://events.static.linuxfound.org/sites/events/files/slides/Copy%20of%20IBNBD-Vault-2017-5.pdf I found the following: iodepth=128 iodepth_batch_submit=128 If you want to keep the pipeline full I think that you need to set the iodepth_batch_submit parameter to a value that is much lower than iodepth. I think that setting iodepth_batch_submit equal to iodepth will yield suboptimal IOPS results. Jens, please correct me if I got this wrong. Thanks, Bart.
Re: [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)
On Mon, Feb 5, 2018 at 3:17 PM, Sagi Grimberg wrote: > Hi Bart, My another 2 cents:) On Fri, Feb 2, 2018 at 6:05 PM, Bart Van Assche wrote: > > > On Fri, 2018-02-02 at 15:08 +0100, Roman Pen wrote: >> >> >> o Simple configuration of IBNBD: >> - Server side is completely passive: volumes do not need to be >>explicitly exported. > > > > That sounds like a security hole? I think the ability to configure > whether or > not an initiator is allowed to log in is essential and also which > volumes > an > initiator has access to. Our design target for well controlled production environment, so security is handle in other layer. >>> >>> >>> >>> What will happen to a new adopter of the code you are contributing? >> >> >> Hi Sagi, Hi Bart, >> thanks for your feedback. >> We considered the "storage cluster" setup, where each ibnbd client has >> access to each ibnbd server. Each ibnbd server manages devices under >> his "dev_search_path" and can provide access to them to any ibnbd >> client in the network. > > > I don't understand how that helps? > >> On top of that Ibnbd server has an additional >> "artificial" restriction, that a device can be mapped in writable-mode >> by only one client at once. > > > I think one would still need the option to disallow readable export as > well. It just occurred to me, that we could easily extend the interface in such a way that each client (i.e. each session) would have on server side her own directory with the devices it can access. I.e. instead of just "dev_search_path" per server, any client would be able to only access devices under /session_name. (session name must already be generated by each client in a unique way). This way one could have an explicit control over which devices can be accessed by which clients. Do you think that would do it?
Re: [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)
On Mon, Feb 5, 2018 at 5:16 PM, Bart Van Assche wrote: > On Mon, 2018-02-05 at 09:56 +0100, Jinpu Wang wrote: >> Hi Bart, >> >> My another 2 cents:) >> On Fri, Feb 2, 2018 at 6:05 PM, Bart Van Assche >> wrote: >> > On Fri, 2018-02-02 at 15:08 +0100, Roman Pen wrote: >> > > o Simple configuration of IBNBD: >> > >- Server side is completely passive: volumes do not need to be >> > > explicitly exported. >> > >> > That sounds like a security hole? I think the ability to configure whether >> > or >> > not an initiator is allowed to log in is essential and also which volumes >> > an >> > initiator has access to. >> >> Our design target for well controlled production environment, so security is >> handle in other layer. On server side, admin can set the dev_search_path in >> module parameter to set parent directory, this will concatenate with the path >> client send in open message to open a block device. > > Hello Jack, > > That approach may work well for your employer but sorry I don't think this is > sufficient for an upstream driver. I think that most users who configure a > network storage target expect full control over which storage devices are > exported > and also over which clients do have and do not have access. > > Bart. Hello Bart, I agree for general purpose, it may be good to have better access control. Thanks, -- Jack Wang Linux Kernel Developer
Re: [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)
On Mon, 2018-02-05 at 09:56 +0100, Jinpu Wang wrote: > Hi Bart, > > My another 2 cents:) > On Fri, Feb 2, 2018 at 6:05 PM, Bart Van Assche > wrote: > > On Fri, 2018-02-02 at 15:08 +0100, Roman Pen wrote: > > > o Simple configuration of IBNBD: > > >- Server side is completely passive: volumes do not need to be > > > explicitly exported. > > > > That sounds like a security hole? I think the ability to configure whether > > or > > not an initiator is allowed to log in is essential and also which volumes an > > initiator has access to. > > Our design target for well controlled production environment, so security is > handle in other layer. On server side, admin can set the dev_search_path in > module parameter to set parent directory, this will concatenate with the path > client send in open message to open a block device. Hello Jack, That approach may work well for your employer but sorry I don't think this is sufficient for an upstream driver. I think that most users who configure a network storage target expect full control over which storage devices are exported and also over which clients do have and do not have access. Bart.
Re: [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)
Hi Bart, My another 2 cents:) On Fri, Feb 2, 2018 at 6:05 PM, Bart Van Assche wrote: On Fri, 2018-02-02 at 15:08 +0100, Roman Pen wrote: o Simple configuration of IBNBD: - Server side is completely passive: volumes do not need to be explicitly exported. That sounds like a security hole? I think the ability to configure whether or not an initiator is allowed to log in is essential and also which volumes an initiator has access to. Our design target for well controlled production environment, so security is handle in other layer. What will happen to a new adopter of the code you are contributing? Hi Sagi, Hi Bart, thanks for your feedback. We considered the "storage cluster" setup, where each ibnbd client has access to each ibnbd server. Each ibnbd server manages devices under his "dev_search_path" and can provide access to them to any ibnbd client in the network. I don't understand how that helps? On top of that Ibnbd server has an additional "artificial" restriction, that a device can be mapped in writable-mode by only one client at once. I think one would still need the option to disallow readable export as well.
Re: [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)
> >> Hi Bart, >> >> My another 2 cents:) >> On Fri, Feb 2, 2018 at 6:05 PM, Bart Van Assche >> wrote: >>> >>> On Fri, 2018-02-02 at 15:08 +0100, Roman Pen wrote: o Simple configuration of IBNBD: - Server side is completely passive: volumes do not need to be explicitly exported. >>> >>> >>> That sounds like a security hole? I think the ability to configure >>> whether or >>> not an initiator is allowed to log in is essential and also which volumes >>> an >>> initiator has access to. >> >> Our design target for well controlled production environment, so >> security is handle in other layer. > > > What will happen to a new adopter of the code you are contributing? Hi Sagi, Hi Bart, thanks for your feedback. We considered the "storage cluster" setup, where each ibnbd client has access to each ibnbd server. Each ibnbd server manages devices under his "dev_search_path" and can provide access to them to any ibnbd client in the network. On top of that Ibnbd server has an additional "artificial" restriction, that a device can be mapped in writable-mode by only one client at once. -- Danil
Re: [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)
Hi Roman and the team (again), replying to my own email :) I forgot to mention that first of all thank you for upstreaming your work! I fully support your goal to have your production driver upstream to minimize your maintenance efforts. I hope that my feedback didn't came across with a different impression, that was certainly not my intent. It would be great if you can address and/or reply to my feedback (as well as others) and re-spin it again. Cheers, Sagi.
Re: [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)
Hi Roman and the team, On 02/02/2018 04:08 PM, Roman Pen wrote: This series introduces IBNBD/IBTRS modules. IBTRS (InfiniBand Transport) is a reliable high speed transport library which allows for establishing connection between client and server machines via RDMA. So its not strictly infiniband correct? It is optimized to transfer (read/write) IO blocks in the sense that it follows the BIO semantics of providing the possibility to either write data from a scatter-gather list to the remote side or to request ("read") data transfer from the remote side into a given set of buffers. IBTRS is multipath capable and provides I/O fail-over and load-balancing functionality. Couple of questions on your multipath implementation? 1. What was your main objective over dm-multipath? 2. What was the consideration of this implementation over creating a stand-alone bio based device node to reinject the bio to the original block device? IBNBD (InfiniBand Network Block Device) is a pair of kernel modules (client and server) that allow for remote access of a block device on the server over IBTRS protocol. After being mapped, the remote block devices can be accessed on the client side as local block devices. Internally IBNBD uses IBTRS as an RDMA transport library. Why? - IBNBD/IBTRS is developed in order to map thin provisioned volumes, thus internal protocol is simple and consists of several request types only without awareness of underlaying hardware devices. Can you explain how the protocol is developed for thin-p? What are the essence of how its suited for it? - IBTRS was developed as an independent RDMA transport library, which supports fail-over and load-balancing policies using multipath, thus it can be used for any other IO needs rather than only for block device. What do you mean by "any other IO"? - IBNBD/IBTRS is faster than NVME over RDMA. Old comparison results: https://www.spinics.net/lists/linux-rdma/msg48799.html (I retested on latest 4.14 kernel - there is no any significant difference, thus I post the old link). That is interesting to learn. Reading your reference brings a couple of questions though, - Its unclear to me how ibnbd performs reads without performing memory registration. Is it using the global dma rkey? - Its unclear to me how there is a difference in noreg in writes, because for small writes nvme-rdma never register memory (it uses inline data). - Looks like with nvme-rdma you max out your iops at 1.6 MIOPs, that seems considerably low against other reports. Can you try and explain what was the bottleneck? This can be a potential bug and I (and the rest of the community is interesting in knowing more details). - srp/scst comparison is really not fair having it in legacy request mode. Can you please repeat it and report a bug to either linux-rdma or to the scst mailing list? - Your latency measurements are surprisingly high for a null target device (even for low end nvme device actually) regardless of the transport implementation. For example: - QD=1 read latency is 648.95 for ibnbd (I assume usecs right?) which is fairly high. on nvme-rdma its 1058 us, which means over 1 millisecond and even 1.254 ms for srp. Last time I tested nvme-rdma read QD=1 latency I got ~14 us. So something does not add up here. If this is not some configuration issue, then we have serious bugs to handle.. - QD=16 the read latencies are > 10ms for null devices?! I'm having troubles understanding how you were able to get such high latencies (> 100 ms for QD>=100) Can you share more information about your setup? It would really help us understand more. - Major parts of the code were rewritten, simplified and overall code size was reduced by a quarter. That is good to know.
Re: [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)
Hi Bart, My another 2 cents:) On Fri, Feb 2, 2018 at 6:05 PM, Bart Van Assche wrote: On Fri, 2018-02-02 at 15:08 +0100, Roman Pen wrote: o Simple configuration of IBNBD: - Server side is completely passive: volumes do not need to be explicitly exported. That sounds like a security hole? I think the ability to configure whether or not an initiator is allowed to log in is essential and also which volumes an initiator has access to. Our design target for well controlled production environment, so security is handle in other layer. What will happen to a new adopter of the code you are contributing?
Re: [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)
Hi Bart, My another 2 cents:) On Fri, Feb 2, 2018 at 6:05 PM, Bart Van Assche wrote: > On Fri, 2018-02-02 at 15:08 +0100, Roman Pen wrote: >> o Simple configuration of IBNBD: >>- Server side is completely passive: volumes do not need to be >> explicitly exported. > > That sounds like a security hole? I think the ability to configure whether or > not an initiator is allowed to log in is essential and also which volumes an > initiator has access to. Our design target for well controlled production environment, so security is handle in other layer. On server side, admin can set the dev_search_path in module parameter to set parent directory, this will concatenate with the path client send in open message to open a block device. > >>- Only IB port GID and device path needed on client side to map >> a block device. > > I think IP addressing is preferred over GID addressing in RoCE networks. > Additionally, have you noticed that GUID configuration support has been added > to the upstream ib_srpt driver? Using GIDs has a very important disadvantage, > namely that at least in IB networks the prefix will change if the subnet > manager is reconfigured. Additionally, in IB networks it may happen that the > target driver is loaded and configured before the GID has been assigned to > all RDMA ports. > > Thanks, > > Bart. Sorry, the above description is not accurate, IBNBD/IBTRS support GID/IPv4/IPv6 addressing. We will adjust in next post. Thanks, -- Jack Wang Linux Kernel Developer
Re: [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)
On Fri, Feb 2, 2018 at 5:40 PM, Doug Ledford wrote: > On Fri, 2018-02-02 at 16:07 +, Bart Van Assche wrote: >> On Fri, 2018-02-02 at 15:08 +0100, Roman Pen wrote: >> > Since the first version the following was changed: >> > >> >- Load-balancing and IO fail-over using multipath features were added. >> >- Major parts of the code were rewritten, simplified and overall code >> > size was reduced by a quarter. >> >> That is interesting to know, but what happened to the feedback that Sagi and >> I provided on v1? Has that feedback been addressed? See also >> https://www.spinics.net/lists/linux-rdma/msg47819.html and >> https://www.spinics.net/lists/linux-rdma/msg47879.html. >> >> Regarding multipath support: there are already two multipath implementations >> upstream (dm-mpath and the multipath implementation in the NVMe initiator). >> I'm not sure we want a third multipath implementation in the Linux kernel. > > There's more than that. There was also md-multipath, and smc-r includes > another version of multipath, plus I assume we support mptcp as well. > > But, to be fair, the different multipaths in this list serve different > purposes and I'm not sure they could all be generalized out and served > by a single multipath code. Although, fortunately, md-multipath is > deprecated, so no need to worry about it, and it is only dm-multipath > and nvme multipath that deal directly with block devices and assume > block semantics. If I read the cover letter right (and I haven't dug > into the code to confirm this), the ibtrs multipath has much more in > common with smc-r multipath, where it doesn't really assume a block > layer device sits on top of it, it's more of a pure network multipath, > which the implementation of smc-r is and mptcp would be too. I would > like to see a core RDMA multipath implementation soon that would > abstract out some of these multipath tasks, at least across RDMA links, > and that didn't have the current limitations (smc-r only supports RoCE > links, and it sounds like ibtrs only supports IB like links, but maybe > I'm wrong there, I haven't looked at the patches yet). Hi Doug, hi Bart, Thanks for your valuable input, here is my 2 cents: IBTRS multipath is indeed a network multipath, with sysfs interface to remove/add path dynamically. IBTRS is built on rdma-cm, so expect to support RoCE and iWARP, but we mainly tested in IB environment, slightly tested on RXE. Regards, -- Jack Wang Linux Kernel Developer
Re: [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)
On Fri, 2018-02-02 at 15:08 +0100, Roman Pen wrote: > o Simple configuration of IBNBD: >- Server side is completely passive: volumes do not need to be > explicitly exported. That sounds like a security hole? I think the ability to configure whether or not an initiator is allowed to log in is essential and also which volumes an initiator has access to. >- Only IB port GID and device path needed on client side to map > a block device. I think IP addressing is preferred over GID addressing in RoCE networks. Additionally, have you noticed that GUID configuration support has been added to the upstream ib_srpt driver? Using GIDs has a very important disadvantage, namely that at least in IB networks the prefix will change if the subnet manager is reconfigured. Additionally, in IB networks it may happen that the target driver is loaded and configured before the GID has been assigned to all RDMA ports. Thanks, Bart.
Re: [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)
On Fri, 2018-02-02 at 16:07 +, Bart Van Assche wrote: > On Fri, 2018-02-02 at 15:08 +0100, Roman Pen wrote: > > Since the first version the following was changed: > > > >- Load-balancing and IO fail-over using multipath features were added. > >- Major parts of the code were rewritten, simplified and overall code > > size was reduced by a quarter. > > That is interesting to know, but what happened to the feedback that Sagi and > I provided on v1? Has that feedback been addressed? See also > https://www.spinics.net/lists/linux-rdma/msg47819.html and > https://www.spinics.net/lists/linux-rdma/msg47879.html. > > Regarding multipath support: there are already two multipath implementations > upstream (dm-mpath and the multipath implementation in the NVMe initiator). > I'm not sure we want a third multipath implementation in the Linux kernel. There's more than that. There was also md-multipath, and smc-r includes another version of multipath, plus I assume we support mptcp as well. But, to be fair, the different multipaths in this list serve different purposes and I'm not sure they could all be generalized out and served by a single multipath code. Although, fortunately, md-multipath is deprecated, so no need to worry about it, and it is only dm-multipath and nvme multipath that deal directly with block devices and assume block semantics. If I read the cover letter right (and I haven't dug into the code to confirm this), the ibtrs multipath has much more in common with smc-r multipath, where it doesn't really assume a block layer device sits on top of it, it's more of a pure network multipath, which the implementation of smc-r is and mptcp would be too. I would like to see a core RDMA multipath implementation soon that would abstract out some of these multipath tasks, at least across RDMA links, and that didn't have the current limitations (smc-r only supports RoCE links, and it sounds like ibtrs only supports IB like links, but maybe I'm wrong there, I haven't looked at the patches yet). -- Doug Ledford GPG KeyID: B826A3330E572FDD Key fingerprint = AE6B 1BDA 122B 23B4 265B 1274 B826 A333 0E57 2FDD signature.asc Description: This is a digitally signed message part
Re: [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)
On Fri, 2018-02-02 at 15:08 +0100, Roman Pen wrote: > Since the first version the following was changed: > >- Load-balancing and IO fail-over using multipath features were added. >- Major parts of the code were rewritten, simplified and overall code > size was reduced by a quarter. That is interesting to know, but what happened to the feedback that Sagi and I provided on v1? Has that feedback been addressed? See also https://www.spinics.net/lists/linux-rdma/msg47819.html and https://www.spinics.net/lists/linux-rdma/msg47879.html. Regarding multipath support: there are already two multipath implementations upstream (dm-mpath and the multipath implementation in the NVMe initiator). I'm not sure we want a third multipath implementation in the Linux kernel. Thanks, Bart.