Re: [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)

2018-06-04 Thread Danil Kipnis
Hi Doug,

thanks for the feedback. You read the cover letter correctly: our
transport library implements multipath (load balancing and failover)
on top of RDMA API. Its name "IBTRS" is slightly misleading in that
regard: it can sit on top of ROCE as well. The library allows for
"bundling" multiple rdma "paths" (source addr - destination addr pair)
into one "session". So our session consists of one or more paths and
each path under the hood consists of as many QPs (each connecting
source with destination) as there are CPUs on the client system. The
user load (In our case IBNBD is a block device and generates some
block requests) is load-balanced on per cpu-basis.
I understand, this is something very different to what smc-r is doing.
Am I right? Do you know what stage MP-RDMA development currently is?

Best,

Danil Kipnis.

P.S. Sorry for the duplicate if any, first mail was returned cause of html.

On Thu, Feb 8, 2018 at 7:10 PM Bart Van Assche  wrote:
>
> On Thu, 2018-02-08 at 18:38 +0100, Danil Kipnis wrote:
> > thanks for the link to the article. To the best of my understanding,
> > the guys suggest to authenticate the devices first and only then
> > authenticate the users who use the devices in order to get access to a
> > corporate service. They also mention in the presentation the current
> > trend of moving corporate services into the cloud. But I think this is
> > not about the devices from which that cloud is build of. Isn't a cloud
> > first build out of devices connected via IB and then users (and their
> > devices) are provided access to the services of that cloud as a whole?
> > If a malicious user already plugged his device into an IB switch of a
> > cloud internal infrastructure, isn't it game over anyway? Can't he
> > just take the hard drives instead of mapping them?
>
> Hello Danil,
>
> It seems like we each have been focussing on different aspects of the article.
> The reason I referred to that article is because I read the following in
> that article: "Unlike the conventional perimeter security model, BeyondCorp
> doesn’t gate access to services and tools based on a user’s physical location
> or the originating network [ ... ] The zero trust architecture spells trouble
> for traditional attacks that rely on penetrating a tough perimeter to waltz
> freely within an open internal network." Suppose e.g. that an organization
> decides to use RoCE or iWARP for connectivity between block storage initiator
> systems and block storage target systems and that it has a single company-
> wide Ethernet network. If the target system does not restrict access based
> on initiator IP address then any penetrator would be able to access all the
> block devices exported by the target after a SoftRoCE or SoftiWARP initiator
> driver has been loaded. If the target system however restricts access based
> on the initiator IP address then that would make it harder for a penetrator
> to access the exported block storage devices. Instead of just penetrating the
> network access, IP address spoofing would have to be used or access would
> have to be obtained to a system that has been granted access to the target
> system.
>
> Thanks,
>
> Bart.
>
>


-- 
Danil Kipnis
Linux Kernel Developer


Re: [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)

2018-02-08 Thread Bart Van Assche
On Thu, 2018-02-08 at 18:38 +0100, Danil Kipnis wrote:
> thanks for the link to the article. To the best of my understanding,
> the guys suggest to authenticate the devices first and only then
> authenticate the users who use the devices in order to get access to a
> corporate service. They also mention in the presentation the current
> trend of moving corporate services into the cloud. But I think this is
> not about the devices from which that cloud is build of. Isn't a cloud
> first build out of devices connected via IB and then users (and their
> devices) are provided access to the services of that cloud as a whole?
> If a malicious user already plugged his device into an IB switch of a
> cloud internal infrastructure, isn't it game over anyway? Can't he
> just take the hard drives instead of mapping them?

Hello Danil,

It seems like we each have been focussing on different aspects of the article.
The reason I referred to that article is because I read the following in
that article: "Unlike the conventional perimeter security model, BeyondCorp
doesn’t gate access to services and tools based on a user’s physical location
or the originating network [ ... ] The zero trust architecture spells trouble
for traditional attacks that rely on penetrating a tough perimeter to waltz
freely within an open internal network." Suppose e.g. that an organization
decides to use RoCE or iWARP for connectivity between block storage initiator
systems and block storage target systems and that it has a single company-
wide Ethernet network. If the target system does not restrict access based
on initiator IP address then any penetrator would be able to access all the
block devices exported by the target after a SoftRoCE or SoftiWARP initiator
driver has been loaded. If the target system however restricts access based
on the initiator IP address then that would make it harder for a penetrator
to access the exported block storage devices. Instead of just penetrating the
network access, IP address spoofing would have to be used or access would
have to be obtained to a system that has been granted access to the target
system.

Thanks,

Bart.




Re: [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)

2018-02-08 Thread Danil Kipnis
On Wed, Feb 7, 2018 at 6:32 PM, Bart Van Assche  wrote:
> On Wed, 2018-02-07 at 18:18 +0100, Roman Penyaev wrote:
>> So the question is: are there real life setups where
>> some of the local IB network members can be untrusted?
>
> Hello Roman,
>
> You may want to read more about the latest evolutions with regard to network
> security. An article that I can recommend is the following: "Google reveals
> own security regime policy trusts no network, anywhere, ever"
> (https://www.theregister.co.uk/2016/04/06/googles_beyondcorp_security_policy/).
>
> If data-centers would start deploying RDMA among their entire data centers
> (maybe they are already doing this) then I think they will want to restrict
> access to block devices to only those initiator systems that need it.
>
> Thanks,
>
> Bart.
>
>

Hi Bart,

thanks for the link to the article. To the best of my understanding,
the guys suggest to authenticate the devices first and only then
authenticate the users who use the devices in order to get access to a
corporate service. They also mention in the presentation the current
trend of moving corporate services into the cloud. But I think this is
not about the devices from which that cloud is build of. Isn't a cloud
first build out of devices connected via IB and then users (and their
devices) are provided access to the services of that cloud as a whole?
If a malicious user already plugged his device into an IB switch of a
cloud internal infrastructure, isn't it game over anyway? Can't he
just take the hard drives instead of mapping them?

Thanks,

Danil.


Re: [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)

2018-02-07 Thread Bart Van Assche
On Wed, 2018-02-07 at 18:18 +0100, Roman Penyaev wrote:
> So the question is: are there real life setups where
> some of the local IB network members can be untrusted?

Hello Roman,

You may want to read more about the latest evolutions with regard to network
security. An article that I can recommend is the following: "Google reveals
own security regime policy trusts no network, anywhere, ever"
(https://www.theregister.co.uk/2016/04/06/googles_beyondcorp_security_policy/).

If data-centers would start deploying RDMA among their entire data centers
(maybe they are already doing this) then I think they will want to restrict
access to block devices to only those initiator systems that need it.

Thanks,

Bart.




Re: [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)

2018-02-07 Thread Roman Penyaev
On Wed, Feb 7, 2018 at 5:35 PM, Christopher Lameter  wrote:
> On Mon, 5 Feb 2018, Bart Van Assche wrote:
>
>> That approach may work well for your employer but sorry I don't think this is
>> sufficient for an upstream driver. I think that most users who configure a
>> network storage target expect full control over which storage devices are 
>> exported
>> and also over which clients do have and do not have access.
>
> Well is that actually true for IPoIB? It seems that I can arbitrarily
> attach to any partition I want without access control. In many ways some
> of the RDMA layers and modules are loose with security since performance
> is what matters mostly and deployments occur in separate production
> environments.
>
> We have had security issues (that not fully resolved yet) with the RDMA
> RPC API for years.. So maybe lets relax on the security requirements a
> bit?
>

Frankly speaking I do not understand the "security" about this kind of
block devices and RDMA in particular.  I can admit that personally I do
not see the whole picture, so can someone provide the real usecase/scenario?
What we have in our datacenters is trusted environment (do others exist?).
You need a volume, you create it.  You need to map a volume remotely -
you map it.  Of course there are provisioning checks, rw/ro checks, etc.
But in general any IP/key checks (is that client really a "good" guy or not?)
are simply useless.  So the question is: are there real life setups where
some of the local IB network members can be untrusted?

--
Roman


Re: [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)

2018-02-07 Thread Christopher Lameter
On Mon, 5 Feb 2018, Bart Van Assche wrote:

> That approach may work well for your employer but sorry I don't think this is
> sufficient for an upstream driver. I think that most users who configure a
> network storage target expect full control over which storage devices are 
> exported
> and also over which clients do have and do not have access.

Well is that actually true for IPoIB? It seems that I can arbitrarily
attach to any partition I want without access control. In many ways some
of the RDMA layers and modules are loose with security since performance
is what matters mostly and deployments occur in separate production
environments.

We have had security issues (that not fully resolved yet) with the RDMA
RPC API for years.. So maybe lets relax on the security requirements a
bit?



Re: [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)

2018-02-07 Thread Bart Van Assche
On Wed, 2018-02-07 at 13:57 +0100, Roman Penyaev wrote:
> On Tue, Feb 6, 2018 at 5:01 PM, Bart Van Assche  
> wrote:
> > On Tue, 2018-02-06 at 14:12 +0100, Roman Penyaev wrote:
> > Something else I would like to understand better is how much of the latency
> > gap between NVMeOF/SRP and IBNBD can be closed without changing the wire
> > protocol. Was e.g. support for immediate data present in the NVMeOF and/or
> > SRP drivers used on your test setup?
> 
> I did not get the question. IBTRS uses empty messages with only imm_data
> field set to respond on IO. This is a part of the IBTRS protocol.  I do
> not understand how can immediate data be present in other drivers, if
> those do not use it in their protocols.  I am lost here.

With "immediate data" I was referring to including the entire write buffer
in the write PDU itself. See e.g. the enable_imm_data kernel module parameter
of the ib_srp-backport driver. See also the use of SRP_DATA_DESC_IMM in the
SCST ib_srpt target driver. Neither the upstream SRP initiator nor the upstream
SRP target support immediate data today. However, sending that code upstream
is on my to-do list.

For the upstream NVMeOF initiator and target drivers, see also the call of
nvme_rdma_map_sg_inline() in nvme_rdma_map_data().

> > Are you aware that the NVMeOF target driver calls page_alloc() from the hot 
> > path but that there are plans to
> > avoid these calls in the hot path by using a caching mechanism similar to
> > the SGV cache in SCST? Are you aware that a significant latency reduction
> > can be achieved by changing the SCST SGV cache from a global into a per-CPU
> > cache?
> 
> No, I am not aware. That is nice, that there is a lot of room for performance
> tweaks. I will definitely retest on fresh kernel once everything is done on
> nvme, scst or ibtrs (especially when we get rid of fmrs and UNSAFE rkeys).

Recently the functions sgl_alloc() and sgl_free() were introduced in the 
upstream
kernel (these will be included in kernel v4.16). The NVMe target driver, LIO and
several other drivers have been modified to use these functions instead of their
own copy of that function. The next step is to replace these function calls by
calls to functions that perform cached allocations.

> > Regarding the SRP measurements: have you tried to set the
> > never_register kernel module parameter to true? I'm asking this because I
> > think that mode is most similar to how the IBNBD initiator driver works.
> 
> yes, according to my notes from that link (frankly, I do not remember,
> but that is what I wrote 1 year ago):
> 
> * Where suffixes mean:
> 
>  _noreg - modules on initiator side (ib_srp, nvme_rdma) were loaded
>   with 'register_always=N' param
> 
> That what you are asking, right?

Not really. With register_always=Y memory registration is always used by the
SRP initiator, even if the data can be coalesced into a single sg entry. With
register_always=N memory registration is only performed if multiple sg entries
are needed to describe the data. And with never_register=Y memory registration
is not used even if multiple sg entries are needed to describe the data buffer.

Thanks,

Bart.






Re: [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)

2018-02-07 Thread Roman Penyaev
Hi Sagi and all,

On Mon, Feb 5, 2018 at 1:30 PM, Sagi Grimberg  wrote:
> Hi Roman and the team (again), replying to my own email :)
>
> I forgot to mention that first of all thank you for upstreaming
> your work! I fully support your goal to have your production driver
> upstream to minimize your maintenance efforts. I hope that my
> feedback didn't came across with a different impression, that was
> certainly not my intent.

Well, I've just recovered from two heart attacks, which I got
while reading your replies, but now I am fine, thanks :)

> It would be great if you can address and/or reply to my feedback
> (as well as others) and re-spin it again.

Jokes aside, we would like to thank you all, guys, for valuable
feedback. I got a lot of useful remarks from you Sagi and you Bart.
We will try to cover them in next version and will provide up-to-date
comparison results.

--
Roman


Re: [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)

2018-02-07 Thread Roman Penyaev
On Tue, Feb 6, 2018 at 5:01 PM, Bart Van Assche  wrote:
> On Tue, 2018-02-06 at 14:12 +0100, Roman Penyaev wrote:
>> On Mon, Feb 5, 2018 at 1:16 PM, Sagi Grimberg  wrote:
>> > [ ... ]
>> > - srp/scst comparison is really not fair having it in legacy request
>> >   mode. Can you please repeat it and report a bug to either linux-rdma
>> >   or to the scst mailing list?
>>
>> Yep, I can retest with mq.
>>
>> > - Your latency measurements are surprisingly high for a null target
>> >   device (even for low end nvme device actually) regardless of the
>> >   transport implementation.
>>
>> Hm, network configuration?  These are results on machines dedicated
>> to our team for testing in one of our datacenters. Nothing special
>> in configuration.
>

Hello Bart,

> I agree that the latency numbers are way too high for a null target device.
> Last time I measured latency for the SRP protocol against an SCST target
> + null block driver at the target side and ConnectX-3 adapters I measured a
> latency of about 14 microseconds. That's almost 100 times less than the
> measurement results in https://www.spinics.net/lists/linux-rdma/msg48799.html.

Here is the following configuration of the setup:

Initiator and target HW configuration:
AMD Opteron 6386 SE, 64CPU, 128Gb
InfiniBand: Mellanox Technologies MT26428
[ConnectX VPI PCIe 2.0 5GT/s - IB QDR / 10GigE]

Also, I remember that between initiator and target there were two IB switches.
Unfortunately, I can't repeat the same configuration, but will retest as
soon as we get new HW.

> Something else I would like to understand better is how much of the latency
> gap between NVMeOF/SRP and IBNBD can be closed without changing the wire
> protocol. Was e.g. support for immediate data present in the NVMeOF and/or
> SRP drivers used on your test setup?

I did not get the question. IBTRS uses empty messages with only imm_data
field set to respond on IO. This is a part of the IBTRS protocol.  I do
not understand how can immediate data be present in other drivers, if
those do not use it in their protocols.  I am lost here.

> Are you aware that the NVMeOF target driver calls page_alloc() from the hot 
> path but that there are plans to
> avoid these calls in the hot path by using a caching mechanism similar to
> the SGV cache in SCST? Are you aware that a significant latency reduction
> can be achieved by changing the SCST SGV cache from a global into a per-CPU
> cache?

No, I am not aware. That is nice, that there is a lot of room for performance
tweaks. I will definitely retest on fresh kernel once everything is done on
nvme, scst or ibtrs (especially when we get rid of fmrs and UNSAFE rkeys).
Maybe there are some other parameters which can be also tweaked?

> Regarding the SRP measurements: have you tried to set the
> never_register kernel module parameter to true? I'm asking this because I
> think that mode is most similar to how the IBNBD initiator driver works.

yes, according to my notes from that link (frankly, I do not remember,
but that is what I wrote 1 year ago):

* Where suffixes mean:

 _noreg - modules on initiator side (ib_srp, nvme_rdma) were loaded
  with 'register_always=N' param

That what you are asking, right?

--
Roman


Re: [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)

2018-02-06 Thread Bart Van Assche
On Tue, 2018-02-06 at 14:12 +0100, Roman Penyaev wrote:
> On Mon, Feb 5, 2018 at 1:16 PM, Sagi Grimberg  wrote:
> > [ ... ]
> > - srp/scst comparison is really not fair having it in legacy request
> >   mode. Can you please repeat it and report a bug to either linux-rdma
> >   or to the scst mailing list?
> 
> Yep, I can retest with mq.
> 
> > - Your latency measurements are surprisingly high for a null target
> >   device (even for low end nvme device actually) regardless of the
> >   transport implementation.
> 
> Hm, network configuration?  These are results on machines dedicated
> to our team for testing in one of our datacenters. Nothing special
> in configuration.

Hello Roman,

I agree that the latency numbers are way too high for a null target device.
Last time I measured latency for the SRP protocol against an SCST target
+ null block driver at the target side and ConnectX-3 adapters I measured a
latency of about 14 microseconds. That's almost 100 times less than the
measurement results in https://www.spinics.net/lists/linux-rdma/msg48799.html.

Something else I would like to understand better is how much of the latency
gap between NVMeOF/SRP and IBNBD can be closed without changing the wire
protocol. Was e.g. support for immediate data present in the NVMeOF and/or
SRP drivers used on your test setup? Are you aware that the NVMeOF target
driver calls page_alloc() from the hot path but that there are plans to
avoid these calls in the hot path by using a caching mechanism similar to
the SGV cache in SCST? Are you aware that a significant latency reduction
can be achieved by changing the SCST SGV cache from a global into a per-CPU
cache? Regarding the SRP measurements: have you tried to set the
never_register kernel module parameter to true? I'm asking this because I
think that mode is most similar to how the IBNBD initiator driver works.

Thanks,

Bart.

Re: [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)

2018-02-06 Thread Bart Van Assche
On Tue, 2018-02-06 at 10:44 +0100, Danil Kipnis wrote:
> the configuration (which devices can be accessed by a particular
> client) can happen also after the kernel target module is loaded. The
> directory in  is a module parameter and is fixed. It
> contains for example "/ibnbd_devices/". But a particular client X
> would be able to only access the devices located in the subdirectory
> "/ibnbd_devices/client_x/". (The sessionname here is client_x) One can
> add or remove the devices from that directory (those are just symlinks
> to /dev/xxx) at any time - before or after the server module is
> loaded. But you are right, we need something additional in order to be
> able to specify which devices a client can access writable and which
> readonly. May be another subdirectories "wr" and "ro" for each client:
> those under /ibnbd_devices/client_x/ro/ can only be read by client_x
> and those in /ibnbd_devices/client_x/wr/ can also be written to?

Please use a standard kernel filesystem (sysfs or configfs) instead of
reinventing it.

Thanks,

Bart.

Re: [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)

2018-02-06 Thread Roman Penyaev
Hi Sagi,

On Mon, Feb 5, 2018 at 1:16 PM, Sagi Grimberg  wrote:
> Hi Roman and the team,
>
> On 02/02/2018 04:08 PM, Roman Pen wrote:
>>
>> This series introduces IBNBD/IBTRS modules.
>>
>> IBTRS (InfiniBand Transport) is a reliable high speed transport library
>> which allows for establishing connection between client and server
>> machines via RDMA.
>
>
> So its not strictly infiniband correct?

This is RDMA.  Original IB prefix is a bit confusing, that's true.

>  It is optimized to transfer (read/write) IO blocks
>>
>> in the sense that it follows the BIO semantics of providing the
>> possibility to either write data from a scatter-gather list to the
>> remote side or to request ("read") data transfer from the remote side
>> into a given set of buffers.
>>
>> IBTRS is multipath capable and provides I/O fail-over and load-balancing
>> functionality.
>
>
> Couple of questions on your multipath implementation?
> 1. What was your main objective over dm-multipath?

No objections, mpath is a part of the transport ibtrs library.

> 2. What was the consideration of this implementation over
> creating a stand-alone bio based device node to reinject the
> bio to the original block device?

ibnbd and ibtrs are separate, on fail-over or load-balancing
we work with IO requests inside a library.

>> IBNBD (InfiniBand Network Block Device) is a pair of kernel modules
>> (client and server) that allow for remote access of a block device on
>> the server over IBTRS protocol. After being mapped, the remote block
>> devices can be accessed on the client side as local block devices.
>> Internally IBNBD uses IBTRS as an RDMA transport library.
>>
>> Why?
>>
>> - IBNBD/IBTRS is developed in order to map thin provisioned volumes,
>>   thus internal protocol is simple and consists of several request
>>  types only without awareness of underlaying hardware devices.
>
>
> Can you explain how the protocol is developed for thin-p? What are the
> essence of how its suited for it?

Here I wanted to emphasize, that we do not support any HW commands,
like nvme does, thus internal protocol consists of several commands.
So answering on your question "how the protocol is developed for thin-p"
I would put it another way around: "protocol does nothing to support
real device, because all we need is to map thin-p volumes".  It is just
simpler.

>> - IBTRS was developed as an independent RDMA transport library, which
>>   supports fail-over and load-balancing policies using multipath, thus
>>  it can be used for any other IO needs rather than only for block
>>  device.
>
>
> What do you mean by "any other IO"?

I mean other IO producers, not only ibnbd, since this is just a transport
library.

>
>> - IBNBD/IBTRS is faster than NVME over RDMA.  Old comparison results:
>>   https://www.spinics.net/lists/linux-rdma/msg48799.html
>>   (I retested on latest 4.14 kernel - there is no any significant
>>   difference, thus I post the old link).
>
>
> That is interesting to learn.
>
> Reading your reference brings a couple of questions though,
> - Its unclear to me how ibnbd performs reads without performing memory
>   registration. Is it using the global dma rkey?

Yes, global rkey.

WRITE: writes from client
READ: writes from server

> - Its unclear to me how there is a difference in noreg in writes,
>   because for small writes nvme-rdma never register memory (it uses
>   inline data).

No support for that.

> - Looks like with nvme-rdma you max out your iops at 1.6 MIOPs, that
>   seems considerably low against other reports. Can you try and explain
>   what was the bottleneck? This can be a potential bug and I (and the
>   rest of the community is interesting in knowing more details).

Sure, I can try.  BTW, what are other reports and numbers?

> - srp/scst comparison is really not fair having it in legacy request
>   mode. Can you please repeat it and report a bug to either linux-rdma
>   or to the scst mailing list?

Yep, I can retest with mq.

> - Your latency measurements are surprisingly high for a null target
>   device (even for low end nvme device actually) regardless of the
>   transport implementation.

Hm, network configuration?  These are results on machines dedicated
to our team for testing in one of our datacenters. Nothing special
in configuration.

> For example:
> - QD=1 read latency is 648.95 for ibnbd (I assume usecs right?) which is
>   fairly high. on nvme-rdma its 1058 us, which means over 1 millisecond
>   and even 1.254 ms for srp. Last time I tested nvme-rdma read QD=1
>   latency I got ~14 us. So something does not add up here. If this is
>   not some configuration issue, then we have serious bugs to handle..
>
> - QD=16 the read latencies are > 10ms for null devices?! I'm having
>   troubles understanding how you were able to get such high latencies
>   (> 100 ms for QD>=100)

What QD stands for? queue depth?  This is not a queue depth, this 

Re: [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)

2018-02-06 Thread Roman Penyaev
On Mon, Feb 5, 2018 at 6:20 PM, Bart Van Assche  wrote:
> On Mon, 2018-02-05 at 18:16 +0100, Roman Penyaev wrote:
>> Everything (fio jobs, setup, etc) is given in the same link:
>>
>> https://www.spinics.net/lists/linux-rdma/msg48799.html
>>
>> at the bottom you will find links on google docs with many pages
>> and archived fio jobs and scripts. (I do not remember exactly,
>> one year passed, but there should be everything).
>>
>> Regarding smaller iodepth_batch_submit - that decreases performance.
>> Once I played with that, even introduced new iodepth_batch_complete_max
>> option for fio, but then I decided to stop and simply chose this
>> configuration, which provides me fastest results.
>
> Hello Roman,
>
> That's weird. For which protocols did reducing iodepth_batch_submit lead
> to lower performance: all the tested protocols or only some of them?

Hi Bart,

Seems that does not depend on protocol (when I tested it was true for nvme
and ibnbd).  That depends on a load.  On high load (1 or few fio jobs are
dedicated to each cpu, and we have 64 cpus) it turns out to be faster to wait
completions for all queue for that particular block dev, instead of switching
from kernel to userspace for each completed IO.

But I can assure you that performance difference is very minor, it exists,
but it does not change the whole picture of what you see on this google
sheet. So what I tried to achieve is to squeeze everything I could, nothing
more.

--
Roman


Re: [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)

2018-02-06 Thread Danil Kipnis
On Mon, Feb 5, 2018 at 7:38 PM, Bart Van Assche  wrote:
> On 02/05/18 08:40, Danil Kipnis wrote:
>>
>> It just occurred to me, that we could easily extend the interface in
>> such a way that each client (i.e. each session) would have on server
>> side her own directory with the devices it can access. I.e. instead of
>> just "dev_search_path" per server, any client would be able to only
>> access devices under /session_name. (session name
>> must already be generated by each client in a unique way). This way
>> one could have an explicit control over which devices can be accessed
>> by which clients. Do you think that would do it?
>
>
> Hello Danil,
>
> That sounds interesting to me. However, I think that approach requires to
> configure client access completely before the kernel target side module is
> loaded. It does not allow to configure permissions dynamically after the
> kernel target module has been loaded. Additionally, I don't see how to
> support attributes per (initiator, block device) pair with that approach.
> LIO e.g. supports the
> /sys/kernel/config/target/srpt/*/*/acls/*/lun_*/write_protect attribute. You
> may want to implement similar functionality if you want to convince more
> users to use IBNBD.
>
> Thanks,
>
> Bart.

Hello Bart,

the configuration (which devices can be accessed by a particular
client) can happen also after the kernel target module is loaded. The
directory in  is a module parameter and is fixed. It
contains for example "/ibnbd_devices/". But a particular client X
would be able to only access the devices located in the subdirectory
"/ibnbd_devices/client_x/". (The sessionname here is client_x) One can
add or remove the devices from that directory (those are just symlinks
to /dev/xxx) at any time - before or after the server module is
loaded. But you are right, we need something additional in order to be
able to specify which devices a client can access writable and which
readonly. May be another subdirectories "wr" and "ro" for each client:
those under /ibnbd_devices/client_x/ro/ can only be read by client_x
and those in /ibnbd_devices/client_x/wr/ can also be written to?

Thanks,

Danil.


Re: [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)

2018-02-05 Thread Bart Van Assche

On 02/05/18 08:40, Danil Kipnis wrote:

It just occurred to me, that we could easily extend the interface in
such a way that each client (i.e. each session) would have on server
side her own directory with the devices it can access. I.e. instead of
just "dev_search_path" per server, any client would be able to only
access devices under /session_name. (session name
must already be generated by each client in a unique way). This way
one could have an explicit control over which devices can be accessed
by which clients. Do you think that would do it?


Hello Danil,

That sounds interesting to me. However, I think that approach requires 
to configure client access completely before the kernel target side 
module is loaded. It does not allow to configure permissions dynamically 
after the kernel target module has been loaded. Additionally, I don't 
see how to support attributes per (initiator, block device) pair with 
that approach. LIO e.g. supports the 
/sys/kernel/config/target/srpt/*/*/acls/*/lun_*/write_protect attribute. 
You may want to implement similar functionality if you want to convince 
more users to use IBNBD.


Thanks,

Bart.


Re: [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)

2018-02-05 Thread Bart Van Assche
On Mon, 2018-02-05 at 18:16 +0100, Roman Penyaev wrote:
> Everything (fio jobs, setup, etc) is given in the same link:
> 
> https://www.spinics.net/lists/linux-rdma/msg48799.html
> 
> at the bottom you will find links on google docs with many pages
> and archived fio jobs and scripts. (I do not remember exactly,
> one year passed, but there should be everything).
> 
> Regarding smaller iodepth_batch_submit - that decreases performance.
> Once I played with that, even introduced new iodepth_batch_complete_max
> option for fio, but then I decided to stop and simply chose this
> configuration, which provides me fastest results.

Hello Roman,

That's weird. For which protocols did reducing iodepth_batch_submit lead
to lower performance: all the tested protocols or only some of them?

Thanks,

Bart.

Re: [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)

2018-02-05 Thread Roman Penyaev
Hi Bart,

On Mon, Feb 5, 2018 at 5:58 PM, Bart Van Assche  wrote:
> On Mon, 2018-02-05 at 14:16 +0200, Sagi Grimberg wrote:
>> - Your latency measurements are surprisingly high for a null target
>>device (even for low end nvme device actually) regardless of the
>>transport implementation.
>>
>> For example:
>> - QD=1 read latency is 648.95 for ibnbd (I assume usecs right?) which is
>>fairly high. on nvme-rdma its 1058 us, which means over 1 millisecond
>>and even 1.254 ms for srp. Last time I tested nvme-rdma read QD=1
>>latency I got ~14 us. So something does not add up here. If this is
>>not some configuration issue, then we have serious bugs to handle..
>>
>> - QD=16 the read latencies are > 10ms for null devices?! I'm having
>>troubles understanding how you were able to get such high latencies
>>(> 100 ms for QD>=100)
>>
>> Can you share more information about your setup? It would really help
>> us understand more.
>
> I would also appreciate it if more information could be provided about the
> measurement results. In addition to answering Sagi's questions, would it
> be possible to share the fio job that was used for measuring latency? In
> https://events.static.linuxfound.org/sites/events/files/slides/Copy%20of%20IBNBD-Vault-2017-5.pdf
> I found the following:
>
> iodepth=128
> iodepth_batch_submit=128
>
> If you want to keep the pipeline full I think that you need to set the
> iodepth_batch_submit parameter to a value that is much lower than iodepth.
> I think that setting iodepth_batch_submit equal to iodepth will yield
> suboptimal IOPS results. Jens, please correct me if I got this wrong.

Sorry, Bart, I would answer here in a few words (I would like to answer
in details tomorrow on Sagi's mail).

Everything (fio jobs, setup, etc) is given in the same link:

https://www.spinics.net/lists/linux-rdma/msg48799.html

at the bottom you will find links on google docs with many pages
and archived fio jobs and scripts. (I do not remember exactly,
one year passed, but there should be everything).

Regarding smaller iodepth_batch_submit - that decreases performance.
Once I played with that, even introduced new iodepth_batch_complete_max
option for fio, but then I decided to stop and simply chose this
configuration, which provides me fastest results.

--
Roman


Re: [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)

2018-02-05 Thread Bart Van Assche
On Mon, 2018-02-05 at 14:16 +0200, Sagi Grimberg wrote:
> - Your latency measurements are surprisingly high for a null target
>device (even for low end nvme device actually) regardless of the
>transport implementation.
> 
> For example:
> - QD=1 read latency is 648.95 for ibnbd (I assume usecs right?) which is
>fairly high. on nvme-rdma its 1058 us, which means over 1 millisecond
>and even 1.254 ms for srp. Last time I tested nvme-rdma read QD=1
>latency I got ~14 us. So something does not add up here. If this is
>not some configuration issue, then we have serious bugs to handle..
> 
> - QD=16 the read latencies are > 10ms for null devices?! I'm having
>troubles understanding how you were able to get such high latencies
>(> 100 ms for QD>=100)
> 
> Can you share more information about your setup? It would really help
> us understand more.

I would also appreciate it if more information could be provided about the
measurement results. In addition to answering Sagi's questions, would it
be possible to share the fio job that was used for measuring latency? In
https://events.static.linuxfound.org/sites/events/files/slides/Copy%20of%20IBNBD-Vault-2017-5.pdf
I found the following:

iodepth=128
iodepth_batch_submit=128

If you want to keep the pipeline full I think that you need to set the
iodepth_batch_submit parameter to a value that is much lower than iodepth.
I think that setting iodepth_batch_submit equal to iodepth will yield
suboptimal IOPS results. Jens, please correct me if I got this wrong.

Thanks,

Bart.




Re: [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)

2018-02-05 Thread Danil Kipnis
On Mon, Feb 5, 2018 at 3:17 PM, Sagi Grimberg  wrote:
>
 Hi Bart,

 My another 2 cents:)
 On Fri, Feb 2, 2018 at 6:05 PM, Bart Van Assche 
 wrote:
>
>
> On Fri, 2018-02-02 at 15:08 +0100, Roman Pen wrote:
>>
>>
>> o Simple configuration of IBNBD:
>>  - Server side is completely passive: volumes do not need to be
>>explicitly exported.
>
>
>
> That sounds like a security hole? I think the ability to configure
> whether or
> not an initiator is allowed to log in is essential and also which
> volumes
> an
> initiator has access to.


 Our design target for well controlled production environment, so
 security is handle in other layer.
>>>
>>>
>>>
>>> What will happen to a new adopter of the code you are contributing?
>>
>>
>> Hi Sagi, Hi Bart,
>> thanks for your feedback.
>> We considered the "storage cluster" setup, where each ibnbd client has
>> access to each ibnbd server. Each ibnbd server manages devices under
>> his "dev_search_path" and can provide access to them to any ibnbd
>> client in the network.
>
>
> I don't understand how that helps?
>
>> On top of that Ibnbd server has an additional
>> "artificial" restriction, that a device can be mapped in writable-mode
>> by only one client at once.
>
>
> I think one would still need the option to disallow readable export as
> well.

It just occurred to me, that we could easily extend the interface in
such a way that each client (i.e. each session) would have on server
side her own directory with the devices it can access. I.e. instead of
just "dev_search_path" per server, any client would be able to only
access devices under /session_name. (session name
must already be generated by each client in a unique way). This way
one could have an explicit control over which devices can be accessed
by which clients. Do you think that would do it?


Re: [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)

2018-02-05 Thread Jinpu Wang
On Mon, Feb 5, 2018 at 5:16 PM, Bart Van Assche  wrote:
> On Mon, 2018-02-05 at 09:56 +0100, Jinpu Wang wrote:
>> Hi Bart,
>>
>> My another 2 cents:)
>> On Fri, Feb 2, 2018 at 6:05 PM, Bart Van Assche  
>> wrote:
>> > On Fri, 2018-02-02 at 15:08 +0100, Roman Pen wrote:
>> > > o Simple configuration of IBNBD:
>> > >- Server side is completely passive: volumes do not need to be
>> > >  explicitly exported.
>> >
>> > That sounds like a security hole? I think the ability to configure whether 
>> > or
>> > not an initiator is allowed to log in is essential and also which volumes 
>> > an
>> > initiator has access to.
>>
>> Our design target for well controlled production environment, so security is
>> handle in other layer. On server side, admin can set the dev_search_path in
>> module parameter to set parent directory, this will concatenate with the path
>> client send in open message to open a block device.
>
> Hello Jack,
>
> That approach may work well for your employer but sorry I don't think this is
> sufficient for an upstream driver. I think that most users who configure a
> network storage target expect full control over which storage devices are 
> exported
> and also over which clients do have and do not have access.
>
> Bart.
Hello Bart,

I agree for general purpose, it may be good to have better access control.

Thanks,
-- 
Jack Wang
Linux Kernel Developer


Re: [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)

2018-02-05 Thread Bart Van Assche
On Mon, 2018-02-05 at 09:56 +0100, Jinpu Wang wrote:
> Hi Bart,
> 
> My another 2 cents:)
> On Fri, Feb 2, 2018 at 6:05 PM, Bart Van Assche  
> wrote:
> > On Fri, 2018-02-02 at 15:08 +0100, Roman Pen wrote:
> > > o Simple configuration of IBNBD:
> > >- Server side is completely passive: volumes do not need to be
> > >  explicitly exported.
> > 
> > That sounds like a security hole? I think the ability to configure whether 
> > or
> > not an initiator is allowed to log in is essential and also which volumes an
> > initiator has access to.
> 
> Our design target for well controlled production environment, so security is
> handle in other layer. On server side, admin can set the dev_search_path in
> module parameter to set parent directory, this will concatenate with the path
> client send in open message to open a block device.

Hello Jack,

That approach may work well for your employer but sorry I don't think this is
sufficient for an upstream driver. I think that most users who configure a
network storage target expect full control over which storage devices are 
exported
and also over which clients do have and do not have access.

Bart.

Re: [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)

2018-02-05 Thread Sagi Grimberg



Hi Bart,

My another 2 cents:)
On Fri, Feb 2, 2018 at 6:05 PM, Bart Van Assche 
wrote:


On Fri, 2018-02-02 at 15:08 +0100, Roman Pen wrote:


o Simple configuration of IBNBD:
 - Server side is completely passive: volumes do not need to be
   explicitly exported.



That sounds like a security hole? I think the ability to configure
whether or
not an initiator is allowed to log in is essential and also which volumes
an
initiator has access to.


Our design target for well controlled production environment, so
security is handle in other layer.



What will happen to a new adopter of the code you are contributing?


Hi Sagi, Hi Bart,
thanks for your feedback.
We considered the "storage cluster" setup, where each ibnbd client has
access to each ibnbd server. Each ibnbd server manages devices under
his "dev_search_path" and can provide access to them to any ibnbd
client in the network.


I don't understand how that helps?


On top of that Ibnbd server has an additional
"artificial" restriction, that a device can be mapped in writable-mode
by only one client at once.


I think one would still need the option to disallow readable export as
well.


Re: [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)

2018-02-05 Thread Danil Kipnis
>
>> Hi Bart,
>>
>> My another 2 cents:)
>> On Fri, Feb 2, 2018 at 6:05 PM, Bart Van Assche 
>> wrote:
>>>
>>> On Fri, 2018-02-02 at 15:08 +0100, Roman Pen wrote:

 o Simple configuration of IBNBD:
 - Server side is completely passive: volumes do not need to be
   explicitly exported.
>>>
>>>
>>> That sounds like a security hole? I think the ability to configure
>>> whether or
>>> not an initiator is allowed to log in is essential and also which volumes
>>> an
>>> initiator has access to.
>>
>> Our design target for well controlled production environment, so
>> security is handle in other layer.
>
>
> What will happen to a new adopter of the code you are contributing?

Hi Sagi, Hi Bart,
thanks for your feedback.
We considered the "storage cluster" setup, where each ibnbd client has
access to each ibnbd server. Each ibnbd server manages devices under
his "dev_search_path" and can provide access to them to any ibnbd
client in the network. On top of that Ibnbd server has an additional
"artificial" restriction, that a device can be mapped in writable-mode
by only one client at once.

-- 
Danil


Re: [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)

2018-02-05 Thread Sagi Grimberg

Hi Roman and the team (again), replying to my own email :)

I forgot to mention that first of all thank you for upstreaming
your work! I fully support your goal to have your production driver
upstream to minimize your maintenance efforts. I hope that my
feedback didn't came across with a different impression, that was
certainly not my intent.

It would be great if you can address and/or reply to my feedback
(as well as others) and re-spin it again.

Cheers,
Sagi.


Re: [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)

2018-02-05 Thread Sagi Grimberg

Hi Roman and the team,

On 02/02/2018 04:08 PM, Roman Pen wrote:

This series introduces IBNBD/IBTRS modules.

IBTRS (InfiniBand Transport) is a reliable high speed transport library
which allows for establishing connection between client and server
machines via RDMA.


So its not strictly infiniband correct?

 It is optimized to transfer (read/write) IO blocks

in the sense that it follows the BIO semantics of providing the
possibility to either write data from a scatter-gather list to the
remote side or to request ("read") data transfer from the remote side
into a given set of buffers.

IBTRS is multipath capable and provides I/O fail-over and load-balancing
functionality.


Couple of questions on your multipath implementation?
1. What was your main objective over dm-multipath?
2. What was the consideration of this implementation over
creating a stand-alone bio based device node to reinject the
bio to the original block device?


IBNBD (InfiniBand Network Block Device) is a pair of kernel modules
(client and server) that allow for remote access of a block device on
the server over IBTRS protocol. After being mapped, the remote block
devices can be accessed on the client side as local block devices.
Internally IBNBD uses IBTRS as an RDMA transport library.

Why?

- IBNBD/IBTRS is developed in order to map thin provisioned volumes,
  thus internal protocol is simple and consists of several request
 types only without awareness of underlaying hardware devices.


Can you explain how the protocol is developed for thin-p? What are the
essence of how its suited for it?


- IBTRS was developed as an independent RDMA transport library, which
  supports fail-over and load-balancing policies using multipath, thus
 it can be used for any other IO needs rather than only for block
 device.


What do you mean by "any other IO"?


- IBNBD/IBTRS is faster than NVME over RDMA.  Old comparison results:
  https://www.spinics.net/lists/linux-rdma/msg48799.html
  (I retested on latest 4.14 kernel - there is no any significant
  difference, thus I post the old link).


That is interesting to learn.

Reading your reference brings a couple of questions though,
- Its unclear to me how ibnbd performs reads without performing memory
  registration. Is it using the global dma rkey?

- Its unclear to me how there is a difference in noreg in writes,
  because for small writes nvme-rdma never register memory (it uses
  inline data).

- Looks like with nvme-rdma you max out your iops at 1.6 MIOPs, that
  seems considerably low against other reports. Can you try and explain
  what was the bottleneck? This can be a potential bug and I (and the
  rest of the community is interesting in knowing more details).

- srp/scst comparison is really not fair having it in legacy request
  mode. Can you please repeat it and report a bug to either linux-rdma
  or to the scst mailing list?

- Your latency measurements are surprisingly high for a null target
  device (even for low end nvme device actually) regardless of the
  transport implementation.

For example:
- QD=1 read latency is 648.95 for ibnbd (I assume usecs right?) which is
  fairly high. on nvme-rdma its 1058 us, which means over 1 millisecond
  and even 1.254 ms for srp. Last time I tested nvme-rdma read QD=1
  latency I got ~14 us. So something does not add up here. If this is
  not some configuration issue, then we have serious bugs to handle..

- QD=16 the read latencies are > 10ms for null devices?! I'm having
  troubles understanding how you were able to get such high latencies
  (> 100 ms for QD>=100)

Can you share more information about your setup? It would really help
us understand more.


- Major parts of the code were rewritten, simplified and overall code
  size was reduced by a quarter.


That is good to know.


Re: [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)

2018-02-05 Thread Sagi Grimberg



Hi Bart,

My another 2 cents:)
On Fri, Feb 2, 2018 at 6:05 PM, Bart Van Assche  wrote:

On Fri, 2018-02-02 at 15:08 +0100, Roman Pen wrote:

o Simple configuration of IBNBD:
- Server side is completely passive: volumes do not need to be
  explicitly exported.


That sounds like a security hole? I think the ability to configure whether or
not an initiator is allowed to log in is essential and also which volumes an
initiator has access to.

Our design target for well controlled production environment, so
security is handle in other layer.


What will happen to a new adopter of the code you are contributing?


Re: [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)

2018-02-05 Thread Jinpu Wang
Hi Bart,

My another 2 cents:)
On Fri, Feb 2, 2018 at 6:05 PM, Bart Van Assche  wrote:
> On Fri, 2018-02-02 at 15:08 +0100, Roman Pen wrote:
>> o Simple configuration of IBNBD:
>>- Server side is completely passive: volumes do not need to be
>>  explicitly exported.
>
> That sounds like a security hole? I think the ability to configure whether or
> not an initiator is allowed to log in is essential and also which volumes an
> initiator has access to.
Our design target for well controlled production environment, so
security is handle in other layer.
On server side, admin can set the dev_search_path in module parameter
to set parent directory,
this will concatenate with the path client send in open message to
open  a block device.


>
>>- Only IB port GID and device path needed on client side to map
>>  a block device.
>
> I think IP addressing is preferred over GID addressing in RoCE networks.
> Additionally, have you noticed that GUID configuration support has been added
> to the upstream ib_srpt driver? Using GIDs has a very important disadvantage,
> namely that at least in IB networks the prefix will change if the subnet
> manager is reconfigured. Additionally, in IB networks it may happen that the
> target driver is loaded and configured before the GID has been assigned to
> all RDMA ports.
>
> Thanks,
>
> Bart.

Sorry, the above description is not accurate, IBNBD/IBTRS support
GID/IPv4/IPv6 addressing.
We will adjust in next post.

Thanks,
-- 
Jack Wang
Linux Kernel Developer


Re: [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)

2018-02-05 Thread Jinpu Wang
On Fri, Feb 2, 2018 at 5:40 PM, Doug Ledford  wrote:
> On Fri, 2018-02-02 at 16:07 +, Bart Van Assche wrote:
>> On Fri, 2018-02-02 at 15:08 +0100, Roman Pen wrote:
>> > Since the first version the following was changed:
>> >
>> >- Load-balancing and IO fail-over using multipath features were added.
>> >- Major parts of the code were rewritten, simplified and overall code
>> >  size was reduced by a quarter.
>>
>> That is interesting to know, but what happened to the feedback that Sagi and
>> I provided on v1? Has that feedback been addressed? See also
>> https://www.spinics.net/lists/linux-rdma/msg47819.html and
>> https://www.spinics.net/lists/linux-rdma/msg47879.html.
>>
>> Regarding multipath support: there are already two multipath implementations
>> upstream (dm-mpath and the multipath implementation in the NVMe initiator).
>> I'm not sure we want a third multipath implementation in the Linux kernel.
>
> There's more than that.  There was also md-multipath, and smc-r includes
> another version of multipath, plus I assume we support mptcp as well.
>
> But, to be fair, the different multipaths in this list serve different
> purposes and I'm not sure they could all be generalized out and served
> by a single multipath code.  Although, fortunately, md-multipath is
> deprecated, so no need to worry about it, and it is only dm-multipath
> and nvme multipath that deal directly with block devices and assume
> block semantics.  If I read the cover letter right (and I haven't dug
> into the code to confirm this), the ibtrs multipath has much more in
> common with smc-r multipath, where it doesn't really assume a block
> layer device sits on top of it, it's more of a pure network multipath,
> which the implementation of smc-r is and mptcp would be too.  I would
> like to see a core RDMA multipath implementation soon that would
> abstract out some of these multipath tasks, at least across RDMA links,
> and that didn't have the current limitations (smc-r only supports RoCE
> links, and it sounds like ibtrs only supports IB like links, but maybe
> I'm wrong there, I haven't looked at the patches yet).
Hi Doug, hi Bart,

Thanks for your valuable input, here is my 2 cents:

IBTRS multipath is indeed a network multipath, with sysfs interface to
remove/add path dynamically.
IBTRS is built on rdma-cm, so expect to support RoCE and iWARP, but we
mainly tested in IB environment,
slightly tested on RXE.


Regards,
-- 
Jack Wang
Linux Kernel Developer


Re: [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)

2018-02-02 Thread Bart Van Assche
On Fri, 2018-02-02 at 15:08 +0100, Roman Pen wrote:
> o Simple configuration of IBNBD:
>- Server side is completely passive: volumes do not need to be
>  explicitly exported.

That sounds like a security hole? I think the ability to configure whether or
not an initiator is allowed to log in is essential and also which volumes an
initiator has access to.

>- Only IB port GID and device path needed on client side to map
>  a block device.

I think IP addressing is preferred over GID addressing in RoCE networks.
Additionally, have you noticed that GUID configuration support has been added
to the upstream ib_srpt driver? Using GIDs has a very important disadvantage,
namely that at least in IB networks the prefix will change if the subnet
manager is reconfigured. Additionally, in IB networks it may happen that the
target driver is loaded and configured before the GID has been assigned to
all RDMA ports.

Thanks,

Bart.

Re: [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)

2018-02-02 Thread Doug Ledford
On Fri, 2018-02-02 at 16:07 +, Bart Van Assche wrote:
> On Fri, 2018-02-02 at 15:08 +0100, Roman Pen wrote:
> > Since the first version the following was changed:
> > 
> >- Load-balancing and IO fail-over using multipath features were added.
> >- Major parts of the code were rewritten, simplified and overall code
> >  size was reduced by a quarter.
> 
> That is interesting to know, but what happened to the feedback that Sagi and
> I provided on v1? Has that feedback been addressed? See also
> https://www.spinics.net/lists/linux-rdma/msg47819.html and
> https://www.spinics.net/lists/linux-rdma/msg47879.html.
> 
> Regarding multipath support: there are already two multipath implementations
> upstream (dm-mpath and the multipath implementation in the NVMe initiator).
> I'm not sure we want a third multipath implementation in the Linux kernel.

There's more than that.  There was also md-multipath, and smc-r includes
another version of multipath, plus I assume we support mptcp as well.

But, to be fair, the different multipaths in this list serve different
purposes and I'm not sure they could all be generalized out and served
by a single multipath code.  Although, fortunately, md-multipath is
deprecated, so no need to worry about it, and it is only dm-multipath
and nvme multipath that deal directly with block devices and assume
block semantics.  If I read the cover letter right (and I haven't dug
into the code to confirm this), the ibtrs multipath has much more in
common with smc-r multipath, where it doesn't really assume a block
layer device sits on top of it, it's more of a pure network multipath,
which the implementation of smc-r is and mptcp would be too.  I would
like to see a core RDMA multipath implementation soon that would
abstract out some of these multipath tasks, at least across RDMA links,
and that didn't have the current limitations (smc-r only supports RoCE
links, and it sounds like ibtrs only supports IB like links, but maybe
I'm wrong there, I haven't looked at the patches yet).

-- 
Doug Ledford 
GPG KeyID: B826A3330E572FDD
Key fingerprint = AE6B 1BDA 122B 23B4 265B  1274 B826 A333 0E57 2FDD

signature.asc
Description: This is a digitally signed message part


Re: [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)

2018-02-02 Thread Bart Van Assche
On Fri, 2018-02-02 at 15:08 +0100, Roman Pen wrote:
> Since the first version the following was changed:
> 
>- Load-balancing and IO fail-over using multipath features were added.
>- Major parts of the code were rewritten, simplified and overall code
>  size was reduced by a quarter.

That is interesting to know, but what happened to the feedback that Sagi and
I provided on v1? Has that feedback been addressed? See also
https://www.spinics.net/lists/linux-rdma/msg47819.html and
https://www.spinics.net/lists/linux-rdma/msg47879.html.

Regarding multipath support: there are already two multipath implementations
upstream (dm-mpath and the multipath implementation in the NVMe initiator).
I'm not sure we want a third multipath implementation in the Linux kernel.

Thanks,

Bart.