Re: [tipc-discussion] [PATCH net-next v2 0/4] tipc: introduce multicast through replication

Jon Maloy Fri, 11 Nov 2016 05:16:31 -0800


> -----Original Message-----
> From: Parthasarathy Bhuvaragan
> Sent: Friday, 11 November, 2016 06:56
> To: Jon Maloy <[email protected]>; [email protected];
> Ying Xue <[email protected]>
> Cc: [email protected]; [email protected]
> Subject: Re: [PATCH net-next v2 0/4] tipc: introduce multicast through 
> replication
> 
> On 11/10/2016 05:08 PM, Jon Maloy wrote:
> >
> >
> >> -----Original Message-----
> >> From: Parthasarathy Bhuvaragan
> >> Sent: Thursday, 10 November, 2016 10:50
> >> To: Jon Maloy <[email protected]>; tipc-
> [email protected];
> >> Ying Xue <[email protected]>
> >> Cc: [email protected]; [email protected]
> >> Subject: Re: [PATCH net-next v2 0/4] tipc: introduce multicast through
> replication
> >>
> >> On 10/27/2016 04:35 PM, Jon Maloy wrote:
> >>> TIPC multicast messages are currently distributed via L2 broadcast
> >>> or IP multicast to all nodes in the cluster, irrespective of the
> >>> number of real destinations of the message.
> >>>
> >>> In this series we introduce an option to transport messages via
> >>> replication ("replicast") across a selected number of unicast links,
> >>> instead of relying on the underlying media. This option is used when
> >>> true broadcast/multicast is not supported by the media, or when the
> >>> number of true destinations is much smaller than the cluster size.
> >>>
> >>> v2: -Fixed a counter bug when removing nodes from destination node list
> >>>     - Moved definition of node destination list from to bcast.{h,c}
> >>>
> >>> Jon Maloy (4):
> >>>   tipc: add function for checking broadcast support in bearer
> >>>   tipc: add functionality to lookup multicast destination nodes
> >>>   tipc: introduce replicast as transport option for multicast
> >>>   tipc: make replicast a user selectable option
> >>>
> >>>  include/uapi/linux/tipc.h |   6 +-
> >>>  net/tipc/bcast.c          | 245
> >> +++++++++++++++++++++++++++++++++++++++++-----
> >>>  net/tipc/bcast.h          |  40 +++++++-
> >>>  net/tipc/bearer.c         |  15 ++-
> >>>  net/tipc/bearer.h         |   6 ++
> >>>  net/tipc/link.c           |  12 ++-
> >>>  net/tipc/msg.c            |  17 ++++
> >>>  net/tipc/msg.h            |   2 +
> >>>  net/tipc/name_table.c     |  33 +++++++
> >>>  net/tipc/name_table.h     |   4 +
> >>>  net/tipc/node.c           |  27 +++--
> >>>  net/tipc/node.h           |   4 +-
> >>>  net/tipc/socket.c         |  89 ++++++++++-------
> >>>  net/tipc/udp_media.c      |   8 +-
> >>>  14 files changed, 424 insertions(+), 84 deletions(-)
> >>>
> >> [partha]
> >> I have a general concern that this design might not work for
> >> non-blocking sockets or blocking socket which set MSG_DONTWAIT flag.
> >>
> >> Consider that the user is using replicasting to 4 peers.
> >> For ex, in tipc_rcast_xmit() we manage to xmit to the first two peers
> >> successfully but the next peer(3) fails due to link congestion. Since
> >> this is a non blocking call we return EAGAIN to user.
> >> The subsequent retry from user will re-deliver the same message to the
> >> first two peers.
> >>
> >> The checks for the congestion are now based on the limits on unicast
> >> links. We will get into the above situation easily as the traffic
> >> pattern on all links are not the same.
> >>
> >> I think you will have a solution to this as always :-).
> >> [/partha]
> >
> > The solution is already there. This is why I have an "unsent" and a "sent" 
> > queue
> in struct tipc_nlist. When a message has been successfully sent to a node (or
> when an error code other than --ELINKCONG is returned) the corresponding
> node item is moved from the "unsent" to the "sent" queue, and will be
> disregarded at next send attempt. But I now see that I have done a stupid
> mistake during the last iteration of this code; I purge the destination list 
> before
> returning to the user, even when returning -EAGAIN. I'll fix that.
> >
> > You may also wonder why I have the two queues in struct tipc_nlist, instead 
> > of
> just deleting the items for sent nodes. This is because this list will be 
> reused
> across different sending sessions in later commits.
> >
> > ///jon
> >
> >
> [partha]
> Jon, the proposal works only for blocking sockets with default send
> timeout (MAX_SCHEDULE_TIMEOUT). When the socket transmits to a peer its
> moved from unsent to sent and put to sleep if link is congested. Once
> the socket receives the      wakeup message (congestion ceases), it
> continues from the list of unsent peers until the message is sent to all.
> 
> Now, consider for the following three socket variants:
> a) blocking sockets with a fixed send timeout (say 100ms)
> b) blocking sockets which set MSG_DONTWAIT flag
> c) non-blocking sockets
> 
> The struct tipc_nlist is stored in the stack of tipc_sendmcast(),
> implying that its stateless between subsequent calls for the same
> socket. This leads to the following issue:


You are right. I realized that already when I had sent my response yesterday.  
One obvious solution would be to allocate the list on the heap, and keep a 
reference to it between the sessions. If so we can just as well keep the 
message itself, i.e. the buffer chain, something we will probably have to do 
anyway as per previous discussion. But what if the sender doesn't send the same 
message the next time? He may have accumulated more data in it, or send a 
completely different message altogether.

I see two safe solutions to this, both of them simple but less than 
satisfactory: 1) If the socket is non-blocking, we always enforce usage of the 
full broadcast mechanism (but what if broadcast is not even supported?). 2) We 
make the socket blocking after all, i.e. we ignore the non-blocking setting if 
replicast is/has to be used.

///jon

> 
> 1. In tipc_sendmcast(), when we send the message to partial recipients
> and experience link congestion, we try to sleep. However, the socket
> send timeout value is so small that we might return from
> tipc_sendmcast() before we get the        wakeup message. Thus we
> managed to send only to a subset of the recipients.
> The replicast state tipc_nlis is destroyed as we exit tipc_sendmcast().
> 
> 2. The application receives an EAGAIN and re-tries with the same
> message. This time tipc_sendmcast() is successful and the message is
> sent to all the replicast recipients.
> 
> Now some of the recipients receive the same message on the socket twice
> due to 1 & 2.
> 
> [/partha]


------------------------------------------------------------------------------
Developer Access Program for Intel Xeon Phi Processors
Access to Intel Xeon Phi processor-based developer platforms.
With one year of Intel Parallel Studio XE.
Training and support from Colfax.
Order your platform today. http://sdm.link/xeonphi
_______________________________________________
tipc-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/tipc-discussion

Re: [tipc-discussion] [PATCH net-next v2 0/4] tipc: introduce multicast through replication

Reply via email to