On 11/10/2016 05:08 PM, Jon Maloy wrote: > > >> -----Original Message----- >> From: Parthasarathy Bhuvaragan >> Sent: Thursday, 10 November, 2016 10:50 >> To: Jon Maloy <[email protected]>; >> [email protected]; >> Ying Xue <[email protected]> >> Cc: [email protected]; [email protected] >> Subject: Re: [PATCH net-next v2 0/4] tipc: introduce multicast through >> replication >> >> On 10/27/2016 04:35 PM, Jon Maloy wrote: >>> TIPC multicast messages are currently distributed via L2 broadcast >>> or IP multicast to all nodes in the cluster, irrespective of the >>> number of real destinations of the message. >>> >>> In this series we introduce an option to transport messages via >>> replication ("replicast") across a selected number of unicast links, >>> instead of relying on the underlying media. This option is used when >>> true broadcast/multicast is not supported by the media, or when the >>> number of true destinations is much smaller than the cluster size. >>> >>> v2: -Fixed a counter bug when removing nodes from destination node list >>> - Moved definition of node destination list from to bcast.{h,c} >>> >>> Jon Maloy (4): >>> tipc: add function for checking broadcast support in bearer >>> tipc: add functionality to lookup multicast destination nodes >>> tipc: introduce replicast as transport option for multicast >>> tipc: make replicast a user selectable option >>> >>> include/uapi/linux/tipc.h | 6 +- >>> net/tipc/bcast.c | 245 >> +++++++++++++++++++++++++++++++++++++++++----- >>> net/tipc/bcast.h | 40 +++++++- >>> net/tipc/bearer.c | 15 ++- >>> net/tipc/bearer.h | 6 ++ >>> net/tipc/link.c | 12 ++- >>> net/tipc/msg.c | 17 ++++ >>> net/tipc/msg.h | 2 + >>> net/tipc/name_table.c | 33 +++++++ >>> net/tipc/name_table.h | 4 + >>> net/tipc/node.c | 27 +++-- >>> net/tipc/node.h | 4 +- >>> net/tipc/socket.c | 89 ++++++++++------- >>> net/tipc/udp_media.c | 8 +- >>> 14 files changed, 424 insertions(+), 84 deletions(-) >>> >> [partha] >> I have a general concern that this design might not work for >> non-blocking sockets or blocking socket which set MSG_DONTWAIT flag. >> >> Consider that the user is using replicasting to 4 peers. >> For ex, in tipc_rcast_xmit() we manage to xmit to the first two peers >> successfully but the next peer(3) fails due to link congestion. Since >> this is a non blocking call we return EAGAIN to user. >> The subsequent retry from user will re-deliver the same message to the >> first two peers. >> >> The checks for the congestion are now based on the limits on unicast >> links. We will get into the above situation easily as the traffic >> pattern on all links are not the same. >> >> I think you will have a solution to this as always :-). >> [/partha] > > The solution is already there. This is why I have an "unsent" and a "sent" > queue in struct tipc_nlist. When a message has been successfully sent to a > node (or when an error code other than --ELINKCONG is returned) the > corresponding node item is moved from the "unsent" to the "sent" queue, and > will be disregarded at next send attempt. But I now see that I have done a > stupid mistake during the last iteration of this code; I purge the > destination list before returning to the user, even when returning -EAGAIN. > I'll fix that. > > You may also wonder why I have the two queues in struct tipc_nlist, instead > of just deleting the items for sent nodes. This is because this list will be > reused across different sending sessions in later commits. > > ///jon > > [partha] Jon, the proposal works only for blocking sockets with default send timeout (MAX_SCHEDULE_TIMEOUT). When the socket transmits to a peer its moved from unsent to sent and put to sleep if link is congested. Once the socket receives the wakeup message (congestion ceases), it continues from the list of unsent peers until the message is sent to all.
Now, consider for the following three socket variants: a) blocking sockets with a fixed send timeout (say 100ms) b) blocking sockets which set MSG_DONTWAIT flag c) non-blocking sockets The struct tipc_nlist is stored in the stack of tipc_sendmcast(), implying that its stateless between subsequent calls for the same socket. This leads to the following issue: 1. In tipc_sendmcast(), when we send the message to partial recipients and experience link congestion, we try to sleep. However, the socket send timeout value is so small that we might return from tipc_sendmcast() before we get the wakeup message. Thus we managed to send only to a subset of the recipients. The replicast state tipc_nlis is destroyed as we exit tipc_sendmcast(). 2. The application receives an EAGAIN and re-tries with the same message. This time tipc_sendmcast() is successful and the message is sent to all the replicast recipients. Now some of the recipients receive the same message on the socket twice due to 1 & 2. [/partha] ------------------------------------------------------------------------------ Developer Access Program for Intel Xeon Phi Processors Access to Intel Xeon Phi processor-based developer platforms. With one year of Intel Parallel Studio XE. Training and support from Colfax. Order your platform today. http://sdm.link/xeonphi _______________________________________________ tipc-discussion mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/tipc-discussion
