Re: [PATCH v2 net-next 0/3] rds: IPv6 support

2018-07-09 Thread Ka-Cheong Poon

On 07/06/2018 11:14 PM, Sowmini Varadhan wrote:


- In net/rds/rds.h;

   /* The following ports, 16385, 18634, 18635, are registered with IANA as
* the ports to be used for RDS over TCP and UDP.  18634 is the historical

   What is "RDS over TCP and UDP"? There is no such thing as RDS-over-UDP.
   IN fact RDS has nothing to do with UDP. The comment is confused. See next
   item below, where the comment disappears.



As mentioned in the above comments, they are registered with IANA.


16385 is registered for  RDS-TCP.



Yes, and it is noted in the comments.  The above comment is stating
the fact so that folks know what is already done when checking the
code.



what does it mean to run rds-tcp and rds_rdma with the CM at the same time?
Is this possible?  (if it is, why not poach some other number like 2049
from NFS?!)



RDS can run over TCP and RDMA at the same time just like other
Internet services running over different transports.



18635 is actually not used in the RDS code.
Why can you not use that for RDS_CM_PORT?



Just like any other Internet services, as long as a number is free,
it can use that number and register it with IANA.  There is no special
meaning about the number itself.  It is only a number to use and need
to be agreed on.  And the normal practice is that the same number be
used for a service running over different Internet transports.  This
is a good practice to avoid confusion.



In general, please do NOT pull up these definitions into net/rds/rds.h
They may change in the future- and the really belong in the transport
specific header file - you question this further below, see my answer
there.



If the number is changed, it will break compatibility.  Just like
any other Internet services, the registered service port number
won't change unless under very special circumstances.



There is no RDS/UDP in Linux but the port numbers are registered
nevertheless.  And RDS can work over UDP AFAICT.  BTW, it is really


No it cannot. RDS needs a reliable transport.  If you start using
UDP (or pigeon-carrier) for the transport you have to build the
reliability yourself.



As you noted above, if one build the reliability on top of UDP, one
can run RDS over UDP.  Is there something which is preventing RDS
to be run over other Internet transports?



The Oracle DB libraries either use RDS-TCP or RDS-IB or UDP (in the UDP
case they do their own congestion management in userspace, and
history has shown that it is hard to get that right, thus the desire to
switch to rds-tcp).



So RDS can in fact run over UDP and there is already an implementation.
Is there any problem with the comment then?



Anyway, even though Andy Grover registered 18635 for "RDS-IB UDP",
that is probably the most harmless one to use for this case, because
- it is unused,
- it calls itself rds-Infiniband,
- the likelihood of doing rds-udp is infinitesmally small (it is more likely
   that we will need to send packets from abc::1 -> fe80:2 before we need
   rds-udp :-) :-) :-))
- and if rds-udp happens, we can use 16385/udp for rds-udp

so please use 18635 for your RDS_CM_PORT and move it to IB specific
header files and lets converge this thread.



I don't understand the rationale.  What is the problem of using 16385?
The above points do not show any problem in using this number.



Towing RDS_TCP_PORT around has absolutely nothing to do with your
ipv6 patch set.



Could you please elaborate this?  For example, is it a problem of
using port 80 for a HTPP web server running over SCTP?  I don't
believe there is.  It is a normal practice in Internet services.



strange that RDS/TCP does not use the port number already registered.
Anyway, the above comments are just a note on this fact in the IANA


16385 was in defacto use for a long time (since 2.6 kernels), thus I
registered it as well when I started fixing this up.



So this is a fix for a previous issue.  This is exactly the point
I made above.  The service port number won't change.  I think this
answers the "they may change in the future" point above.



the 18634/18635 are registered for rds-iB, (caps intended) not rds-tcp.
They are available for your use.



As there is really no problem is using 16385 and Oracle Linux is
already using it, I cannot agree on changing it.  It will just
cause inter-operability problem without any reason.



database.  The following is a link to the IANA assignment.


yes, I am aware


IMHO, there should only be RDS_PORT in the first place.  And it
can be used for all transports.  For example, if RDS/SCTP is added,
it can also use the same port number.  There is no need to define


if/when we add rds/sctp, we shall keep that in mind.



This is the point of the comment.  I'm glad that it is noted.



This is getting to be "arguing for the sake of arguing". I dont have
the time for that.


a new port number for each transport.  What is the rationale that
each transport needs to have its own number and be defined in its
own 

Re: [PATCH v2 net-next 0/3] rds: IPv6 support

2018-07-06 Thread Sowmini Varadhan
On (07/06/18 22:36), Ka-Cheong Poon wrote:
> This patch series does not change existing behavior.  But
> I think this is a strange RDS semantics as it differs from
> other types of socket.  But this is not about IPv6 support
> and can be dealt with later.

sure, 

> >   Since we are choosing to treat this behavior as a feature we
> >   need to be consistent in rds_getname(). If we find
> >   (!peer and !ipv6_addr_any(rs_conn_addr)) and the socket is not yet bound,
> >   the returned address (address size, address family) should be based
> >   on the rs_conn_addr. (Otherwise if I connect to abc::1 without doing a
> >   bind(), and try to do a getsockname(), I get back a sockaddr_in?!)
> 
> 
> OK, will change that.
> >- rds_cancel_sent_to() and rds_connect and rds_bind and rds_sendmsg
> >   As DaveM has already pointed out, we should be using sa_family to figure
> >   out sockaddr_in vs sockaddr_in6 (not the other way around of inspecting
> >   len first, and then the family- that way wont work if I pass in
> >   sockaddr_storage).  E.g., see inet_dgram_connect.
> >
> > if (addr_len < sizeof(uaddr->sa_family))
> > return -EINVAL;
> OK, will change that.

thank you

> >- In net/rds/rds.h;
> >
> >   /* The following ports, 16385, 18634, 18635, are registered with IANA as
> >* the ports to be used for RDS over TCP and UDP.  18634 is the historical
> >
> >   What is "RDS over TCP and UDP"? There is no such thing as RDS-over-UDP.
> >   IN fact RDS has nothing to do with UDP. The comment is confused. See next
> >   item below, where the comment disappears.
> 
> 
> As mentioned in the above comments, they are registered with IANA.

16385 is registered for  RDS-TCP. 

what does it mean to run rds-tcp and rds_rdma with the CM at the same time? 
Is this possible?  (if it is, why not poach some other number like 2049
from NFS?!)

18635 is actually not used in the RDS code.
Why can you not use that for RDS_CM_PORT? 

In general, please do NOT pull up these definitions into net/rds/rds.h
They may change in the future- and the really belong in the transport
specific header file - you question this further below, see my answer
there.

> There is no RDS/UDP in Linux but the port numbers are registered
> nevertheless.  And RDS can work over UDP AFAICT.  BTW, it is really

No it cannot. RDS needs a reliable transport.  If you start using
UDP (or pigeon-carrier) for the transport you have to build the
reliability yourself.

The Oracle DB libraries either use RDS-TCP or RDS-IB or UDP (in the UDP
case they do their own congestion management in userspace, and 
history has shown that it is hard to get that right, thus the desire to
switch to rds-tcp).

Anyway, even though Andy Grover registered 18635 for "RDS-IB UDP",
that is probably the most harmless one to use for this case, because
- it is unused, 
- it calls itself rds-Infiniband,  
- the likelihood of doing rds-udp is infinitesmally small (it is more likely
  that we will need to send packets from abc::1 -> fe80:2 before we need
  rds-udp :-) :-) :-))
- and if rds-udp happens, we can use 16385/udp for rds-udp

so please use 18635 for your RDS_CM_PORT and move it to IB specific
header files and lets converge this thread.

Towing RDS_TCP_PORT around has absolutely nothing to do with your
ipv6 patch set.

> strange that RDS/TCP does not use the port number already registered.
> Anyway, the above comments are just a note on this fact in the IANA

16385 was in defacto use for a long time (since 2.6 kernels), thus I
registered it as well when I started fixing this up. 

the 18634/18635 are registered for rds-iB, (caps intended) not rds-tcp.
They are available for your use.

> database.  The following is a link to the IANA assignment.

yes, I am aware

> IMHO, there should only be RDS_PORT in the first place.  And it
> can be used for all transports.  For example, if RDS/SCTP is added,
> it can also use the same port number.  There is no need to define

if/when we add rds/sctp, we shall keep that in mind. 

This is getting to be "arguing for the sake of arguing". I dont have
the time for that.

> a new port number for each transport.  What is the rationale that
> each transport needs to have its own number and be defined in its
> own header file?

Some transports may not even need a port number. Some may need
several. Some may use sysctl (suggested by Tom Herbert) to make
this configurable. Until recently, we had rds/iWARP, that may need
its own transport hooks, it does not make sense to peanut-butter
that into the core module.

That is why it has to be in the transport. I hope that answers the
question.

> If the behavior of a software component is modified/enhanced such
> that the existing API of this component has problems dealing with
> this new behavior, should the API be modified or a new API be added
> to handle that?  If neither is done, shouldn't this be considered
> a bug?

whatever. The design (parallelization for perf) is fine. Some
parts of 

Re: [PATCH v2 net-next 0/3] rds: IPv6 support

2018-07-06 Thread Ka-Cheong Poon

On 07/06/2018 01:44 AM, Sowmini Varadhan wrote:


- rds_getname(): one of the existing properties of PF_RDS is that a
   connect() does not involve an implicit bind(). Note that you are basing
   the changes in rds_bind() on this behavior, thus I guess we make the
   choice of treating this as a feature, not a bug.



This patch series does not change existing behavior.  But
I think this is a strange RDS semantics as it differs from
other types of socket.  But this is not about IPv6 support
and can be dealt with later.



   Since we are choosing to treat this behavior as a feature we
   need to be consistent in rds_getname(). If we find
   (!peer and !ipv6_addr_any(rs_conn_addr)) and the socket is not yet bound,
   the returned address (address size, address family) should be based
   on the rs_conn_addr. (Otherwise if I connect to abc::1 without doing a
   bind(), and try to do a getsockname(), I get back a sockaddr_in?!)



OK, will change that.



- rds_cancel_sent_to() and rds_connect and rds_bind and rds_sendmsg
   As DaveM has already pointed out, we should be using sa_family to figure
   out sockaddr_in vs sockaddr_in6 (not the other way around of inspecting
   len first, and then the family- that way wont work if I pass in
   sockaddr_storage).  E.g., see inet_dgram_connect.

 if (addr_len < sizeof(uaddr->sa_family))
 return -EINVAL;



OK, will change that.



- In net/rds/rds.h;

   /* The following ports, 16385, 18634, 18635, are registered with IANA as
* the ports to be used for RDS over TCP and UDP.  18634 is the historical

   What is "RDS over TCP and UDP"? There is no such thing as RDS-over-UDP.
   IN fact RDS has nothing to do with UDP. The comment is confused. See next
   item below, where the comment disappears.



As mentioned in the above comments, they are registered with IANA.
There is no RDS/UDP in Linux but the port numbers are registered
nevertheless.  And RDS can work over UDP AFAICT.  BTW, it is really
strange that RDS/TCP does not use the port number already registered.
Anyway, the above comments are just a note on this fact in the IANA
database.  The following is a link to the IANA assignment.

https://www.iana.org/assignments/service-names-port-numbers/service-names-port-numbers.txt



- Also in net/rds/rds.h
   Please dont define transport specific parameters like RD_CM_PORT in the
   common rds.h header. It is unfortunate that we already have RDS_PORT there,
   and we should try to clean that up as well. NOte that RDS_TCP_PORT
   is now in the correct transport-module-specific header (net/rds/tcp.h)
   and its unclean to drag it from there, into the common header as you are
   doing.

   In fact I just tried to move the RDS_PORT definition into
   net/rds/rdma_transport.h and it built just-fine. So to summarize,
   please do the following:
   1. move RDS_PORT into rdma_transport.h
   2. add RDS_CM_PORT into rdma_transport.h
   3. stop dragging RDS_TCP_PORT from its current happy home in net/rds/tcp.h
  to net/rds/rds.h



IMHO, there should only be RDS_PORT in the first place.  And it
can be used for all transports.  For example, if RDS/SCTP is added,
it can also use the same port number.  There is no need to define
a new port number for each transport.  What is the rationale that
each transport needs to have its own number and be defined in its
own header file?



- net/rds/connection.c
   As we have discussed offline before, the fact that we cannot report
   TCP seq# etc info via the existing rds-info API is not "a bug in the
   design of MPRDS" but rather a lacking in the API design. Moreover,
   much of the useful information around the TCP socket is already
   available via procfs, TCP_INFO etc, so the info by rds-info is rarely
   used for rds-tcp- the more useful information is around the RDS socket
   itself.  So there is a bug in the comment, would be nice if you removed it.



If the behavior of a software component is modified/enhanced such
that the existing API of this component has problems dealing with
this new behavior, should the API be modified or a new API be added
to handle that?  If neither is done, shouldn't this be considered
a bug?



   Also, while you are there, s/exisiting/existing, please.



OK, with change that.



General comments:
-
I remain unconvinced by your global <-> link-local arguments.

For UDP sockets we can do this:

  eth0
host1 -- host2
  abc::1/64   fe80::2
  add abc::/64 as onlink subnet route


   host1# traceroute6 -i eth0 -s abc::1 fe80::2

You just broke this for RDS and are using polemic to defend your case,
but the main thrust of your diatribe seems to be "why would you need
this?" I'll try to address that briefly here.



It is unclear why you need to use words like "polemic"
and "diatribe".  What I wrote is in public and folks can
judge whether they are 

Re: [PATCH v2 net-next 0/3] rds: IPv6 support

2018-07-05 Thread Sowmini Varadhan


Some additional comments on this patchset (consolidated here, 
please tease this apart into patch1/patch2/patch3 as appropriate)

I looked at the most of rds-core, and the rds-tcp changes.
Please make sure santosh looks at these carefully, especially.
- RDS bind key  changes
- connection.c
- all the rds_rdma related changes (e.g., the ib* and rdma* files)

- rds_getname(): one of the existing properties of PF_RDS is that a
  connect() does not involve an implicit bind(). Note that you are basing
  the changes in rds_bind() on this behavior, thus I guess we make the
  choice of treating this as a feature, not a bug.

  Since we are choosing to treat this behavior as a feature we
  need to be consistent in rds_getname(). If we find
  (!peer and !ipv6_addr_any(rs_conn_addr)) and the socket is not yet bound, 
  the returned address (address size, address family) should be based
  on the rs_conn_addr. (Otherwise if I connect to abc::1 without doing a 
  bind(), and try to do a getsockname(), I get back a sockaddr_in?!)

- rds_cancel_sent_to() and rds_connect and rds_bind and rds_sendmsg
  As DaveM has already pointed out, we should be using sa_family to figure
  out sockaddr_in vs sockaddr_in6 (not the other way around of inspecting
  len first, and then the family- that way wont work if I pass in
  sockaddr_storage).  E.g., see inet_dgram_connect.

if (addr_len < sizeof(uaddr->sa_family))
return -EINVAL;

- In net/rds/rds.h;

  /* The following ports, 16385, 18634, 18635, are registered with IANA as
   * the ports to be used for RDS over TCP and UDP.  18634 is the historical

  What is "RDS over TCP and UDP"? There is no such thing as RDS-over-UDP. 
  IN fact RDS has nothing to do with UDP. The comment is confused. See next
  item below, where the comment disappears.
  
- Also in net/rds/rds.h
  Please dont define transport specific parameters like RD_CM_PORT in the 
  common rds.h header. It is unfortunate that we already have RDS_PORT there,
  and we should try to clean that up as well. NOte that RDS_TCP_PORT 
  is now in the correct transport-module-specific header (net/rds/tcp.h)
  and its unclean to drag it from there, into the common header as you are 
  doing.

  In fact I just tried to move the RDS_PORT definition into 
  net/rds/rdma_transport.h and it built just-fine. So to summarize,
  please do the following:
  1. move RDS_PORT into rdma_transport.h
  2. add RDS_CM_PORT into rdma_transport.h
  3. stop dragging RDS_TCP_PORT from its current happy home in net/rds/tcp.h
 to net/rds/rds.h

- net/rds/connection.c
  As we have discussed offline before, the fact that we cannot report
  TCP seq# etc info via the existing rds-info API is not "a bug in the
  design of MPRDS" but rather a lacking in the API design. Moreover,
  much of the useful information around the TCP socket is already
  available via procfs, TCP_INFO etc, so the info by rds-info is rarely
  used for rds-tcp- the more useful information is around the RDS socket
  itself.  So there is a bug in the comment, would be nice if you removed it.

  Also, while you are there, s/exisiting/existing, please.

General comments:
-
I remain unconvinced by your global <-> link-local arguments.

For UDP sockets we can do this:

 eth0
   host1 -- host2
 abc::1/64   fe80::2
 add abc::/64 as onlink subnet route


  host1# traceroute6 -i eth0 -s abc::1 fe80::2

You just broke this for RDS and are using polemic to defend your case,
but the main thrust of your diatribe seems to be "why would you need 
this?" I'll try to address that briefly here. 

- There may be lot of valid reasons why host2 does not want to be 
  configured with a global prefix. e.g., I only want host2 to be able 
  to talk to onlink hosts.

- RDS mandatorily requires sockets to be bound. So the normal src addr
  selection (that would have caused host1 to use a link-local to talk
  to host2) is suppressed in this case 

  This is exactly the same as a UDP socket bound to abc::1

  Note well, that one of the use-cases for RDS-TCP is to replace 
  existing infra that uses UDP for cluster-IPC. This has come up before
  on netdev:

  See 
https://www.mail-archive.com/search?l=netdev@vger.kernel.org=subject:%22Re%5C%3A+%5C%5BPATCH+net%5C-next+0%5C%2F6%5C%5D+kcm%5C%3A+Kernel+Connection+Multiplexor+%5C%28KCM%5C%29%22=newest=1

  so feature parity with udp is just as important as feature-parity
  for rds_rdma. 

  I hope that helps you see why we need to not break this gratuituously
  for rds-tcp.

BTW, etiquette is to cc folks who have offered review comments on the
code. Please make sure to cc me in follow-ups to this thread.

Thank you.

--Sowmini


[PATCH v2 net-next 0/3] rds: IPv6 support

2018-06-27 Thread Ka-Cheong Poon
This patch set adds IPv6 support to the kernel RDS and related
modules.  Existing RDS apps using IPv4 address continue to run without
any problem.  New RDS apps which want to use IPv6 address can do so by
passing the address in struct sockaddr_in6 to bind(), connect() or
sendmsg().  And those apps also need to use the new IPv6 equivalents
of some of the existing socket options as the existing options use a
32 bit integer to store IP address.

All RDS code now use struct in6_addr to store IP address.  IPv4
address is stored as an IPv4 mapped address.

Header file changes

There are many data structures (RDS socket options) used by RDS apps
which use a 32 bit integer to store IP address. To support IPv6,
struct in6_addr needs to be used. To ensure backward compatibility, a
new data structure is introduced for each of those data structures
which use a 32 bit integer to represent an IP address. And new socket
options are introduced to use those new structures. This means that
existing apps should work without a problem with the new RDS module.
For apps which want to use IPv6, those new data structures and socket
options can be used. IPv4 mapped address is used to represent IPv4
address in the new data structures.

Internally, all RDS data structures which contain an IP address are
changed to use struct in6_addr to store the address. IPv4 address is
stored as an IPv4 mapped address. All the functions which take an IP
address as argument are also changed to use struct in6_addr.

RDS/RDMA/IB uses a private data (struct rds_ib_connect_private)
exchange between endpoints at RDS connection establishment time to
support RDMA. This private data exchange uses a 32 bit integer to
represent an IP address. This needs to be changed in order to support
IPv6. A new private data struct rds6_ib_connect_private is introduced
to handle this. To ensure backward compatibility, an IPv6 capable RDS
stack uses another RDMA listener port (RDS_CM_PORT) to accept IPv6
connection. And it continues to use the original RDS_PORT for IPv4 RDS
connections. When it needs to communicate with an IPv6 peer, it uses
the RDS_TCP_PORT to send the connection set up request.

RDS/TCP changes

TCP related code is changed to support IPv6.  Note that only an IPv6
TCP listener on port RDS_TCP_PORT is created as it can accept both
IPv4 and IPv6 connection requests.

IB/RDMA changes

The initial private data exchange between IB endpoints using RDMA is
changed to support IPv6 address instead, if the peer address is IPv6.
To ensure backward compatibility, annother RDMA listener port
(RDS_CM_PORT) is used to accept IPv6 connection. An IPv6 capable RDS
module continues to use the original RDS_PORT for IPv4 RDS
connections. When it needs to communicate with an IPv6 peer, it uses
the RDS_CM_PORT to send the connection set up request.

Ka-Cheong Poon (3):
  rds: Changing IP address internal representation to struct in6_addr
  rds: Enable RDS IPv6 support
  rds: Extend RDS API for IPv6 support

 include/uapi/linux/rds.h |  71 ++-
 net/rds/af_rds.c | 164 --
 net/rds/bind.c   | 119 ++-
 net/rds/cong.c   |  23 ++--
 net/rds/connection.c | 249 +--
 net/rds/ib.c | 114 --
 net/rds/ib.h |  45 +--
 net/rds/ib_cm.c  | 301 +++
 net/rds/ib_mr.h  |   2 +
 net/rds/ib_rdma.c|  24 ++--
 net/rds/ib_recv.c|  18 +--
 net/rds/ib_send.c|  10 +-
 net/rds/loop.c   |   7 +-
 net/rds/rdma.c   |   6 +-
 net/rds/rdma_transport.c |  84 ++---
 net/rds/rdma_transport.h |   2 +
 net/rds/rds.h|  85 -
 net/rds/recv.c   |  76 +---
 net/rds/send.c   |  95 ---
 net/rds/tcp.c| 128 
 net/rds/tcp.h|   4 +-
 net/rds/tcp_connect.c|  68 ---
 net/rds/tcp_listen.c |  58 ++---
 net/rds/tcp_recv.c   |   9 +-
 net/rds/tcp_send.c   |   4 +-
 net/rds/threads.c|  69 +--
 net/rds/transport.c  |  15 ++-
 27 files changed, 1426 insertions(+), 424 deletions(-)

-- 
1.8.3.1