Hi Allan, At the last meeting I mentioned that we here at Ericsson had seen some issues with TIPC when using the native interface, combined with using dual links between nodes.
First discovery: --------------- Access of a specific port via the native interface is non- re-entrant. The reason for this is that the header for each message is "cached" in the port structure, which is unprotected from parallel access. (With the socket interface this is no problem, because the port is implicitly protected by sock_lock). Scenario: CPU A is sending a connectionless message. The header is constructed in the corresponding port structure, before being copied into the allocated sk_buff in the call msg_build(). In parallel, CPU B is also sending a message, to a different destination. Before A's header has been copied to the send buffer, it is modified by B. The result that we may have an sk_buff in the send queue to one node, with destination address to another node. Of course destination port, message length and possibly even other header fields may be wrong. We saw this happen several times. I see three remedies for this: b) We add an extra lock, to protect each port. This would give the penalty of having a redundant lock when accessing via the socket interface. b) We build the header on the stack, at least for connectionless messages. For connection-oriented messages the header will not change, except for message packet length. If we make sure to write packet length directly into the sk_buff, just as we do with sequence numbers and other link-layer fields, we can possibly keep the header "cache" for this type of messages. (This needs to be analyzed further). c) We don't change anything. We just make it clear in the Programmers Guide, and in comments in the header file, that these functions are *not* re-entrant, and must be protected by a user-provided lock. I would prefer solution b), if my assumption about connection-oriented headers holds. Second Discovery: --------------------- When running parallel links between two nodes, a race condition occurs between the discovery procedures of the two bearers. Scenario: A discovery message from node A comes in on bearer 1. In tipc_disc_recv_msg(), we check if there already is an allocated structure for node A, and if not, we create one. In parallel, a discovery message from node A also comes in on bearer 2. The same check is done, after bearer 1 did the test, but before the node structure is actually allocated and added to the net structure. This is possible because we only read-lock "net_lock" when a packet arrive in tipc_recv_msg. Unfortunately, we actually *do* change the net structure here, so this is clearly wrong, and results in a nasty crash later on. The work-around we used was to add a local spin_lock around these lines of the code, but this does not feel completely satisfactory. Any suggestions here? Regards ///jon ------------------------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now Search log events and configuration files using AJAX and a browser. Download your FREE copy of Splunk now >> http://get.splunk.com/ _______________________________________________ tipc-discussion mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/tipc-discussion
