Hi Allan,
At the last meeting I mentioned that we here at Ericsson had seen some
issues with TIPC when  using the native interface, combined with using
dual links between nodes.

First discovery:
---------------
Access of a specific port via the native interface is non-
re-entrant. The reason for this is that the header for each message is 
"cached" in
the port structure, which is unprotected from parallel access. (With the 
socket interface
this is no problem, because the port is implicitly protected by sock_lock).
Scenario: CPU A is sending a  connectionless message. The header is 
constructed
in the corresponding port structure, before being copied into the allocated
sk_buff in the call msg_build().
In parallel, CPU B is also sending a message, to a different 
destination. Before A's
header has been copied to the send buffer, it is modified by B. The 
result that
we may have an sk_buff in the send queue to one node, with destination 
address
to another node. Of course destination port, message length and possibly 
even other
header fields may be wrong. We saw this happen several times.

I see three remedies for this: b) We add an extra lock, to protect each 
port. This would
give the penalty of having a redundant lock when accessing via the 
socket interface.
b) We build the header on the stack, at least for connectionless 
messages. For
connection-oriented messages the header will not change, except for 
message packet
length. If we make sure to write packet length directly into the 
sk_buff, just as we do with
sequence numbers and other link-layer fields, we can possibly keep the 
header "cache"
for this type of messages. (This needs to be analyzed further).
c) We don't change anything. We just make it clear in the Programmers 
Guide, and in
comments in the header file, that these functions are *not* re-entrant, 
and must be
protected by a user-provided lock.

I would prefer solution b), if my assumption about connection-oriented 
headers holds.

Second Discovery:
---------------------
When running parallel links between two nodes, a race condition occurs 
between the
discovery procedures of the two bearers.
Scenario: A discovery message from node A comes in on bearer 1. In
tipc_disc_recv_msg(), we check if there already is an allocated 
structure for
node A, and if not, we create one.
In parallel, a discovery message from node A also comes in on bearer 2. 
The same check is
done, after bearer 1 did the test, but before the node structure is 
actually allocated and added
to the net structure.
This is possible because we only read-lock "net_lock" when a packet 
arrive in tipc_recv_msg.
Unfortunately, we actually *do* change the net structure here, so this 
is clearly wrong, and
results in a nasty crash later on.
The work-around we used was to add a local spin_lock around these lines 
of the code, but
this does not feel completely satisfactory. Any suggestions here?


Regards
///jon

-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >> http://get.splunk.com/
_______________________________________________
tipc-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/tipc-discussion

Reply via email to