Re: [PATCH v2 net-next 0/3] rds: IPv6 support
On 07/06/2018 11:14 PM, Sowmini Varadhan wrote: - In net/rds/rds.h; /* The following ports, 16385, 18634, 18635, are registered with IANA as * the ports to be used for RDS over TCP and UDP. 18634 is the historical What is "RDS over TCP and UDP"? There is no such thing as RDS-over-UDP. IN fact RDS has nothing to do with UDP. The comment is confused. See next item below, where the comment disappears. As mentioned in the above comments, they are registered with IANA. 16385 is registered for RDS-TCP. Yes, and it is noted in the comments. The above comment is stating the fact so that folks know what is already done when checking the code. what does it mean to run rds-tcp and rds_rdma with the CM at the same time? Is this possible? (if it is, why not poach some other number like 2049 from NFS?!) RDS can run over TCP and RDMA at the same time just like other Internet services running over different transports. 18635 is actually not used in the RDS code. Why can you not use that for RDS_CM_PORT? Just like any other Internet services, as long as a number is free, it can use that number and register it with IANA. There is no special meaning about the number itself. It is only a number to use and need to be agreed on. And the normal practice is that the same number be used for a service running over different Internet transports. This is a good practice to avoid confusion. In general, please do NOT pull up these definitions into net/rds/rds.h They may change in the future- and the really belong in the transport specific header file - you question this further below, see my answer there. If the number is changed, it will break compatibility. Just like any other Internet services, the registered service port number won't change unless under very special circumstances. There is no RDS/UDP in Linux but the port numbers are registered nevertheless. And RDS can work over UDP AFAICT. BTW, it is really No it cannot. RDS needs a reliable transport. If you start using UDP (or pigeon-carrier) for the transport you have to build the reliability yourself. As you noted above, if one build the reliability on top of UDP, one can run RDS over UDP. Is there something which is preventing RDS to be run over other Internet transports? The Oracle DB libraries either use RDS-TCP or RDS-IB or UDP (in the UDP case they do their own congestion management in userspace, and history has shown that it is hard to get that right, thus the desire to switch to rds-tcp). So RDS can in fact run over UDP and there is already an implementation. Is there any problem with the comment then? Anyway, even though Andy Grover registered 18635 for "RDS-IB UDP", that is probably the most harmless one to use for this case, because - it is unused, - it calls itself rds-Infiniband, - the likelihood of doing rds-udp is infinitesmally small (it is more likely that we will need to send packets from abc::1 -> fe80:2 before we need rds-udp :-) :-) :-)) - and if rds-udp happens, we can use 16385/udp for rds-udp so please use 18635 for your RDS_CM_PORT and move it to IB specific header files and lets converge this thread. I don't understand the rationale. What is the problem of using 16385? The above points do not show any problem in using this number. Towing RDS_TCP_PORT around has absolutely nothing to do with your ipv6 patch set. Could you please elaborate this? For example, is it a problem of using port 80 for a HTPP web server running over SCTP? I don't believe there is. It is a normal practice in Internet services. strange that RDS/TCP does not use the port number already registered. Anyway, the above comments are just a note on this fact in the IANA 16385 was in defacto use for a long time (since 2.6 kernels), thus I registered it as well when I started fixing this up. So this is a fix for a previous issue. This is exactly the point I made above. The service port number won't change. I think this answers the "they may change in the future" point above. the 18634/18635 are registered for rds-iB, (caps intended) not rds-tcp. They are available for your use. As there is really no problem is using 16385 and Oracle Linux is already using it, I cannot agree on changing it. It will just cause inter-operability problem without any reason. database. The following is a link to the IANA assignment. yes, I am aware IMHO, there should only be RDS_PORT in the first place. And it can be used for all transports. For example, if RDS/SCTP is added, it can also use the same port number. There is no need to define if/when we add rds/sctp, we shall keep that in mind. This is the point of the comment. I'm glad that it is noted. This is getting to be "arguing for the sake of arguing". I dont have the time for that. a new port number for each transport. What is the rationale that each transport needs to have its own number and be defined in its own
Re: [PATCH v2 net-next 0/3] rds: IPv6 support
On (07/06/18 22:36), Ka-Cheong Poon wrote: > This patch series does not change existing behavior. But > I think this is a strange RDS semantics as it differs from > other types of socket. But this is not about IPv6 support > and can be dealt with later. sure, > > Since we are choosing to treat this behavior as a feature we > > need to be consistent in rds_getname(). If we find > > (!peer and !ipv6_addr_any(rs_conn_addr)) and the socket is not yet bound, > > the returned address (address size, address family) should be based > > on the rs_conn_addr. (Otherwise if I connect to abc::1 without doing a > > bind(), and try to do a getsockname(), I get back a sockaddr_in?!) > > > OK, will change that. > >- rds_cancel_sent_to() and rds_connect and rds_bind and rds_sendmsg > > As DaveM has already pointed out, we should be using sa_family to figure > > out sockaddr_in vs sockaddr_in6 (not the other way around of inspecting > > len first, and then the family- that way wont work if I pass in > > sockaddr_storage). E.g., see inet_dgram_connect. > > > > if (addr_len < sizeof(uaddr->sa_family)) > > return -EINVAL; > OK, will change that. thank you > >- In net/rds/rds.h; > > > > /* The following ports, 16385, 18634, 18635, are registered with IANA as > >* the ports to be used for RDS over TCP and UDP. 18634 is the historical > > > > What is "RDS over TCP and UDP"? There is no such thing as RDS-over-UDP. > > IN fact RDS has nothing to do with UDP. The comment is confused. See next > > item below, where the comment disappears. > > > As mentioned in the above comments, they are registered with IANA. 16385 is registered for RDS-TCP. what does it mean to run rds-tcp and rds_rdma with the CM at the same time? Is this possible? (if it is, why not poach some other number like 2049 from NFS?!) 18635 is actually not used in the RDS code. Why can you not use that for RDS_CM_PORT? In general, please do NOT pull up these definitions into net/rds/rds.h They may change in the future- and the really belong in the transport specific header file - you question this further below, see my answer there. > There is no RDS/UDP in Linux but the port numbers are registered > nevertheless. And RDS can work over UDP AFAICT. BTW, it is really No it cannot. RDS needs a reliable transport. If you start using UDP (or pigeon-carrier) for the transport you have to build the reliability yourself. The Oracle DB libraries either use RDS-TCP or RDS-IB or UDP (in the UDP case they do their own congestion management in userspace, and history has shown that it is hard to get that right, thus the desire to switch to rds-tcp). Anyway, even though Andy Grover registered 18635 for "RDS-IB UDP", that is probably the most harmless one to use for this case, because - it is unused, - it calls itself rds-Infiniband, - the likelihood of doing rds-udp is infinitesmally small (it is more likely that we will need to send packets from abc::1 -> fe80:2 before we need rds-udp :-) :-) :-)) - and if rds-udp happens, we can use 16385/udp for rds-udp so please use 18635 for your RDS_CM_PORT and move it to IB specific header files and lets converge this thread. Towing RDS_TCP_PORT around has absolutely nothing to do with your ipv6 patch set. > strange that RDS/TCP does not use the port number already registered. > Anyway, the above comments are just a note on this fact in the IANA 16385 was in defacto use for a long time (since 2.6 kernels), thus I registered it as well when I started fixing this up. the 18634/18635 are registered for rds-iB, (caps intended) not rds-tcp. They are available for your use. > database. The following is a link to the IANA assignment. yes, I am aware > IMHO, there should only be RDS_PORT in the first place. And it > can be used for all transports. For example, if RDS/SCTP is added, > it can also use the same port number. There is no need to define if/when we add rds/sctp, we shall keep that in mind. This is getting to be "arguing for the sake of arguing". I dont have the time for that. > a new port number for each transport. What is the rationale that > each transport needs to have its own number and be defined in its > own header file? Some transports may not even need a port number. Some may need several. Some may use sysctl (suggested by Tom Herbert) to make this configurable. Until recently, we had rds/iWARP, that may need its own transport hooks, it does not make sense to peanut-butter that into the core module. That is why it has to be in the transport. I hope that answers the question. > If the behavior of a software component is modified/enhanced such > that the existing API of this component has problems dealing with > this new behavior, should the API be modified or a new API be added > to handle that? If neither is done, shouldn't this be considered > a bug? whatever. The design (parallelization for perf) is fine. Some parts of
Re: [PATCH v2 net-next 0/3] rds: IPv6 support
On 07/06/2018 01:44 AM, Sowmini Varadhan wrote: - rds_getname(): one of the existing properties of PF_RDS is that a connect() does not involve an implicit bind(). Note that you are basing the changes in rds_bind() on this behavior, thus I guess we make the choice of treating this as a feature, not a bug. This patch series does not change existing behavior. But I think this is a strange RDS semantics as it differs from other types of socket. But this is not about IPv6 support and can be dealt with later. Since we are choosing to treat this behavior as a feature we need to be consistent in rds_getname(). If we find (!peer and !ipv6_addr_any(rs_conn_addr)) and the socket is not yet bound, the returned address (address size, address family) should be based on the rs_conn_addr. (Otherwise if I connect to abc::1 without doing a bind(), and try to do a getsockname(), I get back a sockaddr_in?!) OK, will change that. - rds_cancel_sent_to() and rds_connect and rds_bind and rds_sendmsg As DaveM has already pointed out, we should be using sa_family to figure out sockaddr_in vs sockaddr_in6 (not the other way around of inspecting len first, and then the family- that way wont work if I pass in sockaddr_storage). E.g., see inet_dgram_connect. if (addr_len < sizeof(uaddr->sa_family)) return -EINVAL; OK, will change that. - In net/rds/rds.h; /* The following ports, 16385, 18634, 18635, are registered with IANA as * the ports to be used for RDS over TCP and UDP. 18634 is the historical What is "RDS over TCP and UDP"? There is no such thing as RDS-over-UDP. IN fact RDS has nothing to do with UDP. The comment is confused. See next item below, where the comment disappears. As mentioned in the above comments, they are registered with IANA. There is no RDS/UDP in Linux but the port numbers are registered nevertheless. And RDS can work over UDP AFAICT. BTW, it is really strange that RDS/TCP does not use the port number already registered. Anyway, the above comments are just a note on this fact in the IANA database. The following is a link to the IANA assignment. https://www.iana.org/assignments/service-names-port-numbers/service-names-port-numbers.txt - Also in net/rds/rds.h Please dont define transport specific parameters like RD_CM_PORT in the common rds.h header. It is unfortunate that we already have RDS_PORT there, and we should try to clean that up as well. NOte that RDS_TCP_PORT is now in the correct transport-module-specific header (net/rds/tcp.h) and its unclean to drag it from there, into the common header as you are doing. In fact I just tried to move the RDS_PORT definition into net/rds/rdma_transport.h and it built just-fine. So to summarize, please do the following: 1. move RDS_PORT into rdma_transport.h 2. add RDS_CM_PORT into rdma_transport.h 3. stop dragging RDS_TCP_PORT from its current happy home in net/rds/tcp.h to net/rds/rds.h IMHO, there should only be RDS_PORT in the first place. And it can be used for all transports. For example, if RDS/SCTP is added, it can also use the same port number. There is no need to define a new port number for each transport. What is the rationale that each transport needs to have its own number and be defined in its own header file? - net/rds/connection.c As we have discussed offline before, the fact that we cannot report TCP seq# etc info via the existing rds-info API is not "a bug in the design of MPRDS" but rather a lacking in the API design. Moreover, much of the useful information around the TCP socket is already available via procfs, TCP_INFO etc, so the info by rds-info is rarely used for rds-tcp- the more useful information is around the RDS socket itself. So there is a bug in the comment, would be nice if you removed it. If the behavior of a software component is modified/enhanced such that the existing API of this component has problems dealing with this new behavior, should the API be modified or a new API be added to handle that? If neither is done, shouldn't this be considered a bug? Also, while you are there, s/exisiting/existing, please. OK, with change that. General comments: - I remain unconvinced by your global <-> link-local arguments. For UDP sockets we can do this: eth0 host1 -- host2 abc::1/64 fe80::2 add abc::/64 as onlink subnet route host1# traceroute6 -i eth0 -s abc::1 fe80::2 You just broke this for RDS and are using polemic to defend your case, but the main thrust of your diatribe seems to be "why would you need this?" I'll try to address that briefly here. It is unclear why you need to use words like "polemic" and "diatribe". What I wrote is in public and folks can judge whether they are
Re: [PATCH v2 net-next 0/3] rds: IPv6 support
Some additional comments on this patchset (consolidated here, please tease this apart into patch1/patch2/patch3 as appropriate) I looked at the most of rds-core, and the rds-tcp changes. Please make sure santosh looks at these carefully, especially. - RDS bind key changes - connection.c - all the rds_rdma related changes (e.g., the ib* and rdma* files) - rds_getname(): one of the existing properties of PF_RDS is that a connect() does not involve an implicit bind(). Note that you are basing the changes in rds_bind() on this behavior, thus I guess we make the choice of treating this as a feature, not a bug. Since we are choosing to treat this behavior as a feature we need to be consistent in rds_getname(). If we find (!peer and !ipv6_addr_any(rs_conn_addr)) and the socket is not yet bound, the returned address (address size, address family) should be based on the rs_conn_addr. (Otherwise if I connect to abc::1 without doing a bind(), and try to do a getsockname(), I get back a sockaddr_in?!) - rds_cancel_sent_to() and rds_connect and rds_bind and rds_sendmsg As DaveM has already pointed out, we should be using sa_family to figure out sockaddr_in vs sockaddr_in6 (not the other way around of inspecting len first, and then the family- that way wont work if I pass in sockaddr_storage). E.g., see inet_dgram_connect. if (addr_len < sizeof(uaddr->sa_family)) return -EINVAL; - In net/rds/rds.h; /* The following ports, 16385, 18634, 18635, are registered with IANA as * the ports to be used for RDS over TCP and UDP. 18634 is the historical What is "RDS over TCP and UDP"? There is no such thing as RDS-over-UDP. IN fact RDS has nothing to do with UDP. The comment is confused. See next item below, where the comment disappears. - Also in net/rds/rds.h Please dont define transport specific parameters like RD_CM_PORT in the common rds.h header. It is unfortunate that we already have RDS_PORT there, and we should try to clean that up as well. NOte that RDS_TCP_PORT is now in the correct transport-module-specific header (net/rds/tcp.h) and its unclean to drag it from there, into the common header as you are doing. In fact I just tried to move the RDS_PORT definition into net/rds/rdma_transport.h and it built just-fine. So to summarize, please do the following: 1. move RDS_PORT into rdma_transport.h 2. add RDS_CM_PORT into rdma_transport.h 3. stop dragging RDS_TCP_PORT from its current happy home in net/rds/tcp.h to net/rds/rds.h - net/rds/connection.c As we have discussed offline before, the fact that we cannot report TCP seq# etc info via the existing rds-info API is not "a bug in the design of MPRDS" but rather a lacking in the API design. Moreover, much of the useful information around the TCP socket is already available via procfs, TCP_INFO etc, so the info by rds-info is rarely used for rds-tcp- the more useful information is around the RDS socket itself. So there is a bug in the comment, would be nice if you removed it. Also, while you are there, s/exisiting/existing, please. General comments: - I remain unconvinced by your global <-> link-local arguments. For UDP sockets we can do this: eth0 host1 -- host2 abc::1/64 fe80::2 add abc::/64 as onlink subnet route host1# traceroute6 -i eth0 -s abc::1 fe80::2 You just broke this for RDS and are using polemic to defend your case, but the main thrust of your diatribe seems to be "why would you need this?" I'll try to address that briefly here. - There may be lot of valid reasons why host2 does not want to be configured with a global prefix. e.g., I only want host2 to be able to talk to onlink hosts. - RDS mandatorily requires sockets to be bound. So the normal src addr selection (that would have caused host1 to use a link-local to talk to host2) is suppressed in this case This is exactly the same as a UDP socket bound to abc::1 Note well, that one of the use-cases for RDS-TCP is to replace existing infra that uses UDP for cluster-IPC. This has come up before on netdev: See https://www.mail-archive.com/search?l=netdev@vger.kernel.org=subject:%22Re%5C%3A+%5C%5BPATCH+net%5C-next+0%5C%2F6%5C%5D+kcm%5C%3A+Kernel+Connection+Multiplexor+%5C%28KCM%5C%29%22=newest=1 so feature parity with udp is just as important as feature-parity for rds_rdma. I hope that helps you see why we need to not break this gratuituously for rds-tcp. BTW, etiquette is to cc folks who have offered review comments on the code. Please make sure to cc me in follow-ups to this thread. Thank you. --Sowmini
[PATCH v2 net-next 0/3] rds: IPv6 support
This patch set adds IPv6 support to the kernel RDS and related modules. Existing RDS apps using IPv4 address continue to run without any problem. New RDS apps which want to use IPv6 address can do so by passing the address in struct sockaddr_in6 to bind(), connect() or sendmsg(). And those apps also need to use the new IPv6 equivalents of some of the existing socket options as the existing options use a 32 bit integer to store IP address. All RDS code now use struct in6_addr to store IP address. IPv4 address is stored as an IPv4 mapped address. Header file changes There are many data structures (RDS socket options) used by RDS apps which use a 32 bit integer to store IP address. To support IPv6, struct in6_addr needs to be used. To ensure backward compatibility, a new data structure is introduced for each of those data structures which use a 32 bit integer to represent an IP address. And new socket options are introduced to use those new structures. This means that existing apps should work without a problem with the new RDS module. For apps which want to use IPv6, those new data structures and socket options can be used. IPv4 mapped address is used to represent IPv4 address in the new data structures. Internally, all RDS data structures which contain an IP address are changed to use struct in6_addr to store the address. IPv4 address is stored as an IPv4 mapped address. All the functions which take an IP address as argument are also changed to use struct in6_addr. RDS/RDMA/IB uses a private data (struct rds_ib_connect_private) exchange between endpoints at RDS connection establishment time to support RDMA. This private data exchange uses a 32 bit integer to represent an IP address. This needs to be changed in order to support IPv6. A new private data struct rds6_ib_connect_private is introduced to handle this. To ensure backward compatibility, an IPv6 capable RDS stack uses another RDMA listener port (RDS_CM_PORT) to accept IPv6 connection. And it continues to use the original RDS_PORT for IPv4 RDS connections. When it needs to communicate with an IPv6 peer, it uses the RDS_TCP_PORT to send the connection set up request. RDS/TCP changes TCP related code is changed to support IPv6. Note that only an IPv6 TCP listener on port RDS_TCP_PORT is created as it can accept both IPv4 and IPv6 connection requests. IB/RDMA changes The initial private data exchange between IB endpoints using RDMA is changed to support IPv6 address instead, if the peer address is IPv6. To ensure backward compatibility, annother RDMA listener port (RDS_CM_PORT) is used to accept IPv6 connection. An IPv6 capable RDS module continues to use the original RDS_PORT for IPv4 RDS connections. When it needs to communicate with an IPv6 peer, it uses the RDS_CM_PORT to send the connection set up request. Ka-Cheong Poon (3): rds: Changing IP address internal representation to struct in6_addr rds: Enable RDS IPv6 support rds: Extend RDS API for IPv6 support include/uapi/linux/rds.h | 71 ++- net/rds/af_rds.c | 164 -- net/rds/bind.c | 119 ++- net/rds/cong.c | 23 ++-- net/rds/connection.c | 249 +-- net/rds/ib.c | 114 -- net/rds/ib.h | 45 +-- net/rds/ib_cm.c | 301 +++ net/rds/ib_mr.h | 2 + net/rds/ib_rdma.c| 24 ++-- net/rds/ib_recv.c| 18 +-- net/rds/ib_send.c| 10 +- net/rds/loop.c | 7 +- net/rds/rdma.c | 6 +- net/rds/rdma_transport.c | 84 ++--- net/rds/rdma_transport.h | 2 + net/rds/rds.h| 85 - net/rds/recv.c | 76 +--- net/rds/send.c | 95 --- net/rds/tcp.c| 128 net/rds/tcp.h| 4 +- net/rds/tcp_connect.c| 68 --- net/rds/tcp_listen.c | 58 ++--- net/rds/tcp_recv.c | 9 +- net/rds/tcp_send.c | 4 +- net/rds/threads.c| 69 +-- net/rds/transport.c | 15 ++- 27 files changed, 1426 insertions(+), 424 deletions(-) -- 1.8.3.1