Re: iscsi ifaces / multipathing / etc
Mike Christie wrote: I started to do a pseudo patch to see how it would work With the DHCP thing at hand possibly changing the initiator IP address, I think what we want to do for iser is use NIC names for ifaces as I do now for iscsi/tcp with the user taking care of NIC name persistently across reboots through udev et al. I hope we can implement a loop which goes over the ip address associated with this NIC and finds one of them which in the subnet of the target portal we're connecting to. This would be the source ip fed into the enhanced ep_connect exported by iser in the kernel. # ip r s 10.10.1.91 dev ib1 scope link 10.10.0.91 dev ib0 scope link 10.10.0.0/16 dev ib0 proto kernel scope link src 10.10.5.157 10.10.0.0/16 dev ib1 proto kernel scope link src 10.10.5.158 With this do you get 2 paths then? For iface binding we want to try and get 4. So if got really unlucky and ib0's port and the target portal connected to ib1 died, then you could still have two usable paths. yes, I have two paths this way which serves me well at this point. Four paths may be problematic in case the system is a small scale one, e.g wired with two switches, each connected to all initiators and targets and there's 1-2 inter-switch-links (ISLs). Two paths are passing on ISL and should be prioritized lower then the two which don't, is there a way to do that? Or. -- You received this message because you are subscribed to the Google Groups open-iscsi group. To post to this group, send email to open-is...@googlegroups.com. To unsubscribe from this group, send email to open-iscsi+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/open-iscsi?hl=en.
Re: iscsi ifaces / multipathing / etc
Mike Christie wrote: Ok let me clarify one other thing too. With the network layer we get a ethX, or whatever you set up your udev rules to use, for each port. That is what I am saying above represents the device name and port. The iscsi drivers that interact with the linux net layer like bnx2i, cxgb3i and iscsi_tcp supports using the netdevice name (same name you see in /sys/class/net or ifconfig) for iscsi iface binding. yep, understood. We are talking about the iscsi iface.net_ifacename param right? It is the name you see in sysfs or in ifconfig. You set it to iface.net_ifacename and it gets passed directly down to in the BINDTODEVICE setsockopt. sure, understood as well. For drivers that do not have a netdevice name because they do not interact with the network layer (they have an entire net stack in firmware) we use the MAC. So this includes qla4xxx and be2iscsi, and iscsi_tcp also supports this. In that case we loop over the net interfaces and match the MAC with a ethX and then pass that to BINDTODEVICE setsockopt. understood, as I have explained, mac is problematic for iser from two reasons: under bonding it will not function well (we use the fail_over_mac option of bonding), and also since some of the mac isn't persistent today so we have to enhance the iscsi code to allow using only a portion of it. Or. -- You received this message because you are subscribed to the Google Groups open-iscsi group. To post to this group, send email to open-is...@googlegroups.com. To unsubscribe from this group, send email to open-iscsi+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/open-iscsi?hl=en.
Re: iscsi ifaces / multipathing / etc
Mike Christie wrote: Maybe that is where I am missing your point. One of the reasons for this is to tell the kernel to ignore the routing rules and use the port we want. So I was saying if there is no a API to do this then just add one like someone did for tcp and the socket code. bind-to-device like interface doesn't exist in the rdma-cm api which is based now only on IP, however, as I wrote, once you know a source ip, it is possible to bypass the routing rules by call rdma_bind on this ip or providing it as the source address in rdma_resolve_addr. With this at hand, I don't think I want to add new interface to the rdma-cm now. I understand using IP is a bit problematic when DHCP is used for the initiator, let me think about that a bit, thanks How about this, We can add a: ISCSI_UEVENT_TRANSPORT_EP_CONNECT_THROUGH_SRC = UEVENT_BASE + 20 And for that we can pass the src addr after the dest addr like how you wanted yes, basically, I think this is the way / route (...) I want to go through. The down side is this touching three pieces of the code (user-space, kernel transport, and iser), but I don't think we have much (any) other choices. For userspace, I think it is wrong to make the user interface based on the kernel API. If we used something like the name you see in ifconfig (ibX is the default for most drivers I think) then users can set the iface.net_ifacename to it. iscsid can just loop over the interfaces and match ibX with the ip and pass that down. The loop would be pretty much the same as in get_hwaddress_from_netdev() where we loop over the interfaces but it would do If we are going on proving a source IP to the kernel then why go through the code that loops on interfaces? is this b/c today the semantics of s iface.ipaddress is set this address to the net device associated with iscsi offload? such that you don't want to create confusion/complexity? if this is the case, and you want the iser users to specify netdevice name such that user space translates it to source ip, then the code below has to be someone more involved: transport.c: iser_transport { .ep_connect = iser_ep_connect; }; somewhere.c: getifaddrs() for (ifa = ifap; ifa; ifa = ifa-ifa_next) { iser_ep_connect() if (!strcmp(ifa-ifa_name, match_name)) ktransport_ep_connect( ifa-ifa_addr); here this comparison isn't sufficient as someone might put few IPs on the net-device and we want to find one of them which is in the subnet of the target portal (sorry...) Or. netlink.c ktransport_ep_connect( , struct sockaddr *src_addr) { if (srcaddr) ec-type = ISCSI_UEVENT_TRANSPORT_EP_CONNECT_THROUGH_SRC else if else . memcpy(setparam_buf + sizeof(*ev), dst_addr, addrlen) if (srcaddr) memcpy(setparam_buf + currlen, src_addr, src_addrlen) } For the userspace interface for the user we can do pretty much anything. -- You received this message because you are subscribed to the Google Groups open-iscsi group. To post to this group, send email to open-is...@googlegroups.com. To unsubscribe from this group, send email to open-iscsi+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/open-iscsi?hl=en.
Re: iscsi ifaces / multipathing / etc
Mike Christie micha...@cs.wisc.edu wrote: If the device name and port do not change normally that seems better to me since it works like the other drivers just to be on the same page, when you say the other drivers (=transports) this doesn't include tcp, correct? for the tcp what's supported in the --net-- device name such that a BINDTODEVICE setsockopt is done to bind the socket to this netdevice. As for the IB device name and port number, indeed they don't change normally, but OTOH the rdma-cm api which iser is using is IP base and reolves IB local/remote addresses (= GUIDs which relate to devices/ports) from IP addresses/routing rules and not the other way around. I understand using IP is a bit problematic when DHCP is used for the initiator, let me think about that a bit, thanks Or. -- You received this message because you are subscribed to the Google Groups open-iscsi group. To post to this group, send email to open-is...@googlegroups.com. To unsubscribe from this group, send email to open-iscsi+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/open-iscsi?hl=en.
Re: iscsi ifaces / multipathing / etc
On 03/02/2010 02:42 PM, Or Gerlitz wrote: Mike Christiemicha...@cs.wisc.edu wrote: If the device name and port do not change normally that seems better to me since it works like the other drivers just to be on the same page, when you say the other drivers (=transports) this doesn't include tcp, correct? for the tcp what's Ok let me clarify one other thing too. With the network layer we get a ethX, or whatever you set up your udev rules to use, for each port. That is what I am saying above represents the device name and port. The iscsi drivers that interact with the linux net layer like bnx2i, cxgb3i and iscsi_tcp supports using the netdevice name (same name you see in /sys/class/net or ifconfig) for iscsi iface binding. supported in the --net-- device name such that a BINDTODEVICE setsockopt is done to bind the socket to this netdevice. We are talking about the iscsi iface.net_ifacename param right? It is the name you see in sysfs or in ifconfig. You set it to iface.net_ifacename and it gets passed directly down to in the BINDTODEVICE setsockopt. For drivers that do not have a netdevice name because they do not interact with the network layer (they have an entire net stack in firmware) we use the MAC. So this includes qla4xxx and be2iscsi, and iscsi_tcp also supports this. In that case we loop over the net interfaces and match the MAC with a ethX and then pass that to BINDTODEVICE setsockopt. As for the IB device name and port number, indeed they don't change normally, but OTOH the rdma-cm api which iser is using is IP base and reolves IB local/remote addresses (= GUIDs which relate to devices/ports) from IP addresses/routing rules and not the other way around. I understand using IP is a bit problematic when DHCP is used for the initiator, let me think about that a bit, thanks Or. -- You received this message because you are subscribed to the Google Groups open-iscsi group. To post to this group, send email to open-is...@googlegroups.com. To unsubscribe from this group, send email to open-iscsi+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/open-iscsi?hl=en.
Re: iscsi ifaces / multipathing / etc
On 03/02/2010 02:42 PM, Or Gerlitz wrote: As for the IB device name and port number, indeed they don't change normally, but OTOH the rdma-cm api which iser is using is IP base and reolves IB local/remote addresses (= GUIDs which relate to devices/ports) from IP addresses/routing rules and not the other way around. Maybe that is where I am missing your point. One of the reasons for this is to tell the kernel to ignore the routing rules and use the port we want. So I was saying if there is no a API to do this then just add one like someone did for tcp and the socket code. Assuming the interface you want to use does what we want then let's use it and move on. I understand using IP is a bit problematic when DHCP is used for the initiator, let me think about that a bit, thanks How about this. We can add a: ISCSI_UEVENT_TRANSPORT_EP_CONNECT_THROUGH_SRC = UEVENT_BASE + 20 And for that we can pass the src addr after the dest addr like how you wanted. For userspace, I think it is wrong to make the user interface based on the kernel API. If we used something like the name you see in ifconfig (ibX is the default for most drivers I think) then users can set the iface.net_ifacename to it. iscsid can just loop over the interfaces and match ibX with the ip and pass that down. The loop would be pretty much the same as in get_hwaddress_from_netdev() where we loop over the interfaces but it would do transport.c: iser_transport { .ep_connect = iser_ep_connect; }; somewhere.c: getifaddrs() for (ifa = ifap; ifa; ifa = ifa-ifa_next) { iser_ep_connect() if (!strcmp(ifa-ifa_name, match_name)) ktransport_ep_connect( ifa-ifa_addr); netlink.c ktransport_ep_connect( , struct sockaddr *src_addr) { if (srcaddr) ec-type = ISCSI_UEVENT_TRANSPORT_EP_CONNECT_THROUGH_SRC else if else . memcpy(setparam_buf + sizeof(*ev), dst_addr, addrlen) if (srcaddr) memcpy(setparam_buf + currlen, src_addr, src_addrlen) } For the userspace interface for the user we can do pretty much anything. -- You received this message because you are subscribed to the Google Groups open-iscsi group. To post to this group, send email to open-is...@googlegroups.com. To unsubscribe from this group, send email to open-iscsi+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/open-iscsi?hl=en.
Re: iscsi ifaces / multipathing / etc
On 03/01/2010 02:46 AM, Or Gerlitz wrote: Mike Christie wrote: Ah never mind. For some reason I thought you had to have a mask, but if you give rdma_resolve_addr a addr then it will do the right thing and use only the port you wanted right? YES. providing rdma_resolve_addr a source address is like calling rdma_bind with this source address and then calling rdma_resolve_address with only destination address. So its like bind(2) in that respect. Currently the rdma stack through the include/rdma/rdma_cm.h rdma-cm api doesn't support things like SO_BINDTODEVICE to either of network or rdma device. But even if it would/will, I prefer to stay with IP addresses. I am not sure I got why you prefer to use the IP? Was your reason that part you wrote about iser being IP based? I understand how SO_BINDTODEVICE is used for the tcp transport, but its all done in user space, and later when the connection is bounded to the end-point (-- socket created/binded/connected from user space) things are moved to the kernel. This isn't the case with iser. I believe that at this point we agree that there should be a way to specify the source address bounded by the user to the iscsi interface to the kernel iser transport code, correct? Whether it is done in the kernel or userspace or if you support SO_BINDTODEVICE or not is not a issue. We can change userspace so you get a device like tcp and offload and/or we can change the kernel in any sane way so you can bind by whatever. You do not have to support SO_BINDTODEVICE. You can work like how the offload drivers do. I am not sure where you get that I agree ip address is best. All I am saying above is I think I see the API you wanted to use. If the device name and port do not change normally that seems better to me since it works like the other drivers. -- You received this message because you are subscribed to the Google Groups open-iscsi group. To post to this group, send email to open-is...@googlegroups.com. To unsubscribe from this group, send email to open-iscsi+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/open-iscsi?hl=en.
Re: iscsi ifaces / multipathing / etc
On 03/01/2010 06:29 PM, Mike Christie wrote: If the device name and port do not change normally that seems better to me since it works like the other drivers. Oh yeah, just to be clear, I am saying I prefer above, but that is based on what I understand today. As I said I did not understand why you think IP based is best for iser when all other drivers use the other option. Beat your point into my head if I am not getting it :) I am open to change. -- You received this message because you are subscribed to the Google Groups open-iscsi group. To post to this group, send email to open-is...@googlegroups.com. To unsubscribe from this group, send email to open-iscsi+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/open-iscsi?hl=en.
Re: iscsi ifaces / multipathing / etc
On 02/25/2010 11:24 PM, Mike Christie wrote: Do you mean you need a src IP for rdma_resolve_addr? If so, if you have a src ip, can you override what rdma_resolve_addr gives you for cases like where you have two ports on the same subnet? Ah never mind. For some reason I thought you had to have a mask, but if you give rdma_resolve_addr a addr then it will do the right thing and use only the port you wanted right? -- You received this message because you are subscribed to the Google Groups open-iscsi group. To post to this group, send email to open-is...@googlegroups.com. To unsubscribe from this group, send email to open-iscsi+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/open-iscsi?hl=en.
Re: iscsi ifaces / multipathing / etc
Mike Christie wrote: I am fine with either. The netdev name (ethX) also has the same problems where udev can change it. It is there for aliases or vlans where we cannot use hwaddress since multiple netdevs have the same MAC. yes, correct both netdevice name and hwaddress suffer from what you describe, shit happens. Do you have similar issues with ib device names? with out HW changes no, if someone plugs in another card of the same type between two boots then yes. What about bonding? Can you use bonding/trunking with iser? Is that going to cause troubles? yes, indeed, we support bonding but unlike in ethernet where the bond MAC address is typically that of the current/active slave device, over ipoib, bonding automatically set the fail_over_mac option to active and the bond HW address is the one of the active slave which means we don't want an iscsi iser interface to be associated to hwaddress in that case. So we are remained with net device names or ip addresses, I don't want to go on hw device names since the whole addressing framework of iser is the same as the one used for TCP e.g based on IP addresses. I prefer the source IP address or at least the source ip address and mask, through which I can get a source ip on this subnet, even as of DHCP between reboots the source ip has been changed. thinking on this a bit more, DHCP related changed between reboots can be quite destructive to iscsi/nfs etc since the target IP can change. If this is the case and one really want to avoid that, one needs to use host names for the portal and not ip addresses, isn't it? this looks to me like going a bit too far away. All in all, when non default interface is needed/used (e.g for multipathing), I am quite sure we need to have some sort of source ip for iser in ep_connect, please let me know what you think would be the easy/best or close to either of (...) way to have that. Or. -- You received this message because you are subscribed to the Google Groups open-iscsi group. To post to this group, send email to open-is...@googlegroups.com. To unsubscribe from this group, send email to open-iscsi+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/open-iscsi?hl=en.
Re: iscsi ifaces / multipathing / etc
On 02/25/2010 06:03 AM, Or Gerlitz wrote: Mike Christie wrote: I am fine with either. The netdev name (ethX) also has the same problems where udev can change it. It is there for aliases or vlans where we cannot use hwaddress since multiple netdevs have the same MAC. yes, correct both netdevice name and hwaddress suffer from what you describe, shit happens. Do you have similar issues with ib device names? with out HW changes no, if someone plugs in another card of the same type between two boots then yes. What about bonding? Can you use bonding/trunking with iser? Is that going to cause troubles? yes, indeed, we support bonding but unlike in ethernet where the bond MAC address is typically that of the current/active slave device, over ipoib, bonding automatically set the fail_over_mac option to active and the bond HW address is the one of the active slave which means we don't want an iscsi iser interface to be associated to hwaddress in that case. So we are remained with net device names or ip addresses, I don't want to go on hw device names since the whole addressing framework of iser is the same as the one used for TCP e.g based on IP addresses. I prefer the source IP address or at least the source ip address and mask, through which I can get a source ip on this subnet, even as of DHCP between reboots the source ip has been changed. thinking on this a bit more, DHCP related changed between reboots can be quite destructive to iscsi/nfs etc since the target IP can change. If this is the case and one really want to avoid that, one needs to use host names for the portal and not ip addresses, isn't it? this looks to me like going a bit too far away. I think on the target side you normally use static ips (I think you normally have a lot fewer target portals than initiators). People also actually do use hostnames for the target portals too. There have been a couple bug reports on the list on how open-iscsi does not support them correctly. All in all, when non default interface is needed/used (e.g for multipathing), I am quite sure we need to have some sort of source ip for iser in ep_connect, please let me know what you think would be the easy/best or close to either of (...) way to have that. Do you mean you need a src IP for rdma_resolve_addr? If so, if you have a src ip, can you override what rdma_resolve_addr gives you for cases like where you have two ports on the same subnet? -- You received this message because you are subscribed to the Google Groups open-iscsi group. To post to this group, send email to open-is...@googlegroups.com. To unsubscribe from this group, send email to open-iscsi+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/open-iscsi?hl=en.
Re: iscsi ifaces / multipathing / etc
On 02/25/2010 11:24 PM, Mike Christie wrote: All in all, when non default interface is needed/used (e.g for multipathing), I am quite sure we need to have some sort of source ip for iser in ep_connect, please let me know what you think would be the easy/best or close to either of (...) way to have that. Do you mean you need a src IP for rdma_resolve_addr? If so, if you have a src ip, can you override what rdma_resolve_addr gives you for cases like where you have two ports on the same subnet? Oh yeah, if you are wondering, for iscsi_tcp iface binding there is a sockopt to tell the network layer to forget what it would do (use the route table) normally, and instead send IO through the netdev we pass it. So in userspace we use the netdev we have from iface.net_ifacename or match the MAC with a netdev and then pass that to the sockopt. Do you have something similar for ib? -- You received this message because you are subscribed to the Google Groups open-iscsi group. To post to this group, send email to open-is...@googlegroups.com. To unsubscribe from this group, send email to open-iscsi+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/open-iscsi?hl=en.
Re: iscsi ifaces / multipathing / etc
Mike Christie wrote: So is there anything in there that is static and can be used to identify the port? With srp you can bind a target to a rnic port, right (from srp_add_port it looks like the add_target file is on the srp classes rnic port object)? In userspace for the setup then how do you make sure that across boots the target is accessed through the same port? Is the ib_device-name and the port persistent? Yes, the struct ib_device-name and the ports of this device are persistent. The NIC device name may change as of udev, hot-plugs etc. As for making sure that across boots we go through the same physical source point for the path, you probably have a point here which we can solve for iser if we go on the hwaddress as the identifier of the iscsi interface, thoughts? Or. -- You received this message because you are subscribed to the Google Groups open-iscsi group. To post to this group, send email to open-is...@googlegroups.com. To unsubscribe from this group, send email to open-iscsi+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/open-iscsi?hl=en.
Re: iscsi ifaces / multipathing / etc
On 02/24/2010 10:04 AM, Or Gerlitz wrote: Mike Christie wrote: So is there anything in there that is static and can be used to identify the port? With srp you can bind a target to a rnic port, right (from srp_add_port it looks like the add_target file is on the srp classes rnic port object)? In userspace for the setup then how do you make sure that across boots the target is accessed through the same port? Is the ib_device-name and the port persistent? Yes, the struct ib_device-name and the ports of this device are persistent. The NIC device name may change as of udev, hot-plugs etc. As for making sure that across boots we go through the same physical source point for the path, you probably have a point here which we can solve for iser if we go on the hwaddress as the identifier of the iscsi interface, thoughts? I am fine with either. The netdev name (ethX) also has the same problems where udev can change it. It is there for aliases or vlans where we cannot use hwaddress since multiple netdevs have the same MAC. Do you have similar issues with ib device names? What about bonding? Can you use bonding/trunking with iser? Is that going to cause troubles? -- You received this message because you are subscribed to the Google Groups open-iscsi group. To post to this group, send email to open-is...@googlegroups.com. To unsubscribe from this group, send email to open-iscsi+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/open-iscsi?hl=en.
Re: iscsi ifaces / multipathing / etc
Mike Christie wrote: 2. I wasn't sure if there is and if yes what is the transport role in detecting session failure. It varies from transport to transport. For iscsi_tcp we do not really have a nice way to figure out if the someone just tripped over a cable so that is where the nop comes from. We can tell if the tcp state changes and so you can see iscsi_tcp_state_change notify the upper layers of a problem for that. understood. Still, the noop-out based watch-dog serve all transports, correct? I'd like to narrow down things and understand if/what is the transport role: Some iscsi drivers will runn iscsi_conn/session_failure when they discover a link down event or someone doing ifdown. I thought this is sort of what you are able to do with iser_cma_handler-iser_disconnected_handler or with the call to iscsi_conn_failure in iser_handle_comp_error Yes, we call iscsi_conn/session_failure but I wasn't really sure if multipathing works for non tcp transports if they never make these calls or they have to. If there are other places you can detect a link failure type of problem you would want to call iscsi_conn_failure, so the iscsi layer can begin trying to recover the connection and let dm-multipath know there is a problem. I understand that once there's timeout on the noop out watch-dog, the iscsi layer will call ep_disconnect, correct? currently our ep_disconnect is sometime too slow and I can change that. But, still I wasn't sure if for iscsi to let dm-multipath that there is a failure something is needed at the transport side or not... I do see that there's an shost param to ep_connect, is there a way it can give me a hint on the source IP? I do not think it can help iser as it is today. Remember when we talked about a shost per some physical/virtual resource vs a shost per session. This is another place where that came in. bnx2i, cxgb3i and be2iscsi allocate a host per port/netdev, so that is how they know the src they should be using. I will have to think about how to do it for iser as it is today with the host per session how about extending the ep_connect user/netlink/kernel/iscsi_transport framework to support the functionality provided by the user space code of bind_conn_to_iface or bind_src_by_address, basically, since the connection establishment framework is IP based, I would happy to just get some source ip in the kernel when ep_connect is called. I saw the comment on why bind_src_by_address is problematic, but this doesn't apply to IB/iser. A question for you. Some people do not like using the the netdev name for the binding since it can change between boots. The default method is to use iface.hwaddress instead of iface.net_ifacename. For iscsi it is just the MAC. For iser how big is the RNIC's equivalent of the MAC? iser is working now over IB and at some point we'll make it work also over iWARP. With IB, the RNIC is IPoIB NIC whose HW address (equiv of MAC) is 20 bytes long. It turns out that some of these 20 bytes may change... the part which is burned is called GUID and is 8 bytes long, here you see two IPoIB NICs, ib0 and ib1 and the port GUIDs they are using are 00:02:c9:03:00:02:6b:df and 00:02:c9:03:00:02:6b:e0 7: ib0: BROADCAST,MULTICAST,UP,LOWER_UP mtu 2044 qdisc pfifo_fast qlen 256 link/infiniband 80:00:00:48:fe:80:00:00:00:00:00:00:00:02:c9:03:00:02:6b:df 8: ib1: BROADCAST,MULTICAST mtu 2044 qdisc pfifo_fast qlen 256 link/infiniband 80:00:00:49:fe:80:00:00:00:00:00:00:00:02:c9:03:00:02:6b:e0 If you really interested to learn how these 20 bytes are composed its in the form of flags:QPN:GID (1:3:16 bytes) where GID is of the form PREFIX:GUID (8:8 bytes) do wget http://ietf.org/rfc/rfc4391.txt and see section 9.1.1. Link-Layer Address/Hardware Address. Note that the ifconfig output is buggy so you should use $ ip address show anyway, I wasn't really sure if/how the iface binding by hw address is working in open iscsi, specifically, I wasn't able to track which library exports net_get_netdev_from_hwaddress ... but I am quite sure this (binding iface to hw address and not netdev) works well for iscsi-tcp and offloads, correct? Or. -- You received this message because you are subscribed to the Google Groups open-iscsi group. To post to this group, send email to open-is...@googlegroups.com. To unsubscribe from this group, send email to open-iscsi+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/open-iscsi?hl=en.
Re: iscsi ifaces / multipathing / etc
On 02/22/2010 09:32 AM, Or Gerlitz wrote: Mike Christie wrote: 2. I wasn't sure if there is and if yes what is the transport role in detecting session failure. It varies from transport to transport. For iscsi_tcp we do not really have a nice way to figure out if the someone just tripped over a cable so that is where the nop comes from. We can tell if the tcp state changes and so you can see iscsi_tcp_state_change notify the upper layers of a problem for that. understood. Still, the noop-out based watch-dog serve all transports, correct? Yes. I'd like to narrow down things and understand if/what is the transport role: For the nop out path, the trasnport just has to send/recv the nop pdu/response. Some iscsi drivers will runn iscsi_conn/session_failure when they discover a link down event or someone doing ifdown. I thought this is sort of what you are able to do with iser_cma_handler-iser_disconnected_handler or with the call to iscsi_conn_failure in iser_handle_comp_error Yes, we call iscsi_conn/session_failure but I wasn't really sure if multipathing works for non tcp transports if they never make these calls or they have to. They do not have to make those calls for multipath to work. Multipath will work better if the transport can signal when there is a problem, because we can stop using a bad path and get IO going to a working path faster. If the transport does nothing then we have to rely on the scsi error handler/timeout to detect the problem and that is very slow. If there are other places you can detect a link failure type of problem you would want to call iscsi_conn_failure, so the iscsi layer can begin trying to recover the connection and let dm-multipath know there is a problem. I understand that once there's timeout on the noop out watch-dog, the iscsi layer will call ep_disconnect, correct? currently our ep_disconnect is sometime too slow Yes. You should also change your ep_disconnect because it is not supposed to block (did we talk about this or was this just bnx2i), since it will stop iscsid from processing other events. and I can change that. But, still I wasn't sure if for iscsi to let dm-multipath that there is a failure something is needed at the transport side or not... I do not think there is anything special. It should handle a error like it would if multipath was not used. The user will set the iscsi timers like replacement_timeout and nop timeout differently if they are using multipath. I do see that there's an shost param to ep_connect, is there a way it can give me a hint on the source IP? I do not think it can help iser as it is today. Remember when we talked about a shost per some physical/virtual resource vs a shost per session. This is another place where that came in. bnx2i, cxgb3i and be2iscsi allocate a host per port/netdev, so that is how they know the src they should be using. I will have to think about how to do it for iser as it is today with the host per session how about extending the ep_connect user/netlink/kernel/iscsi_transport framework to support the functionality provided by the user space code of bind_conn_to_iface or bind_src_by_address, basically, since the connection establishment framework is IP based, I would happy to just get some source ip in the kernel when ep_connect is called. I saw the comment on why bind_src_by_address is problematic, but this doesn't apply to IB/iser. Which comment are you talking about? Are you talking about bind() not doing what you would want for iscsi_tcp (target sometimes sends data to the wrong port) or are you talking about if you were to use DHCP and so the IPs could change over boots? A question for you. Some people do not like using the the netdev name for the binding since it can change between boots. The default method is to use iface.hwaddress instead of iface.net_ifacename. For iscsi it is just the MAC. For iser how big is the RNIC's equivalent of the MAC? iser is working now over IB and at some point we'll make it work also over iWARP. With IB, the RNIC is IPoIB NIC whose HW address (equiv of MAC) is 20 bytes long. It turns out that some of these 20 bytes may change... the part which is burned So is there anything in there that is static and can be used to identify the port? is called GUID and is 8 bytes long, here you see two IPoIB NICs, ib0 and ib1 and the port GUIDs they are using are 00:02:c9:03:00:02:6b:df and 00:02:c9:03:00:02:6b:e0 7: ib0:BROADCAST,MULTICAST,UP,LOWER_UP mtu 2044 qdisc pfifo_fast qlen 256 link/infiniband 80:00:00:48:fe:80:00:00:00:00:00:00:00:02:c9:03:00:02:6b:df 8: ib1:BROADCAST,MULTICAST mtu 2044 qdisc pfifo_fast qlen 256 link/infiniband 80:00:00:49:fe:80:00:00:00:00:00:00:00:02:c9:03:00:02:6b:e0 If you really interested to learn how these 20 bytes are composed its in the form of flags:QPN:GID (1:3:16 bytes) where GID is of the form PREFIX:GUID (8:8 bytes) do wget