reading through the ZRE spec http://rfc.zeromq.org/spec:20

I see a fundamental problem of state agreement through the node discovery 
process as outline there. The problem is with this sentence "When a ZRE node 
receives a beacon from a node that it does not already know about, it SHALL 
consider this to be a new peer".

First, it is unclear what the 'joined' status actually means, since that isnt 
defined. If I assume 'joined' means '2-way communication is possible with this 
node', then the method outlined fails to ascertain that property consistently 
across nodes.

This is why:

- Assume you have a network which happens to support only one-way traffic, 
maybe intermittedly. Note this is a _very_ common failure mode at least 
temporarily as routing changes happen in IP networks, so it better be a case to 
be prepared for.
- Assume 2 nodes, A and B. 
- Assume A can send packets to B, but packets from B fail to reach A (at least 
temporarily).
- A 'joins' the net
- B receives A's beacon
- A does NOT receive B's beacon
- A and B disagree about the 'joined' status.

ZRE is not the first protocol to have to deal with such issues. See for 
instance the OSPF Hello FSM, or the very similar session FSM in the Server 
Cache Synchronisation protocol (RFC2334). In particular, the OSPF spec is well 
written and a suggested read on the topic.

The way how OSPF deals with the issue in principle is to distinguish between 
1-way and 2-way adjacency like so:

- a joining node A sends its own node ID, PLUS a list of all node ID's it has 
already heard (which may be empty at startup)
- a receiving node B getting such a 'HELLO' packet checks its own ID against 
the list of 'already heard' ID's in the received packet.
-- if its own node ID is not contained, it transitions the state for A to 
'1-way' .
-- if it own node ID _is_ contained, it transitions the state for A to '2-way' 
(in OSPF called 'full')
-- it replies with a its own node ID, and includes A's node ID.
- on reception, A sees its node ID in the reply, and transitions the state for 
B to '2-way'.
- due to keepalive timers and periodic HELLO exchange, nodes eventually 
converge on a consistent view of state

OSPF Hello packets also contain another important attributes, like keepalive 
and 'router dead' timer values, which are worth considering as part of beacon 
exchange.

----

I think, and hope ZRE will turn out extremely useful. Let's just stop 
reinventing the wheel on the basics.

- Michael


_______________________________________________
zeromq-dev mailing list
[email protected]
http://lists.zeromq.org/mailman/listinfo/zeromq-dev

Reply via email to