Sounds pretty sensible. You might want to consider having separate repairers from the publishers, particular if you have a bursty source of messages. Then if a subscriber can't keep up they can go to the repairer without effecting the publisher.
Being smart about the batching as well can make the system perform a bit more smoothly in failure modes, so if a subscriber is failing to keep up and dropping occasional messages, it may be best to disconnect until its backlog is processed, pull a large batch in the recovery mode, then reconnect to the stream. Ian On Wed, May 8, 2013 at 12:18 AM, Doron Somech <[email protected]> wrote: > Hi All, > > Usually we are using zeromq with pgm as our message bus. We are using > message bus to publish events between server side services. > > The issue is that we need to support environment where multicast is not > supported (like amazon cloud). > > I'm working on a design to make tcp based message bus and want to get your > thoughts on that. > > There are three major requirements, we want services to be able to come > and go without need to reconfigure the system, we want a brokeless design > and we want to be able to recover lost messages between a publisher and a > subscriber (caused by connection problem) like pgm does. > > We have three types of components, a discovery service, publisher and > subscriber. > > Discovery Service is a standalone service, the discovery service has the > list of all the subscribers in the network, the subscriber ping the > discovery service every X seconds, when specific subscriber didn't ping the > service for more than Y seconds it consider dead. On every new subscriber > the publisher publish a message to all the publishers. For high > availability there are more than one discovery services (probably 3). > > When publisher is starting it's asking the discovery service for all of > the subscribers and subscribe for new subscribers (it asked all configured > discovery services and takes the first answer, it subscribed for all of the > discovery services). After getting the list the publisher is connecting to > all of the subscribers. The publisher also connects to every new > subscriber. The publisher is ignoring dead subscribers (mostly because I > don't know how to handle it because the dead message can come from one of > the discovery service but can still be alive on others). > > All the messages the publisher is sending are numbered, also the publisher > is saving the X last messages it sends to support recovery of lost > messages. Each publisher has a unique random id. > > If publisher doesn't send a message in X seconds the publisher will send a > keep alive message to all subscribers. > > As mentioned the subscriber ping the discovery services every X seconds, > when the subscriber get a message from a publisher for the first time it's > saving the message number. From there if the subscriber detects a gap in > the messages it directly connects to the publisher (using request-response) > and asking for the missing messages. The only problem is that in lost > messages situation the subscriber will stop handle new messages from all > publishers until the missing messages are restored. > > If the publisher doesn't have those messages anymore the subscriber should > raise an exception or restart the entire service. > > The only thing the subscriber and publisher need to know is the addresses > of the discovery services. > > The reason I want the publisher to connect to the subscriber is to make > sure when the connection is dropped the publisher will be able to recognize > it and reconnect (the subscriber may not be able to recognize it because it > doesn't send any data to the publishers). > Thanks, I will very much appreciate your comments. > > Doron > > _______________________________________________ > zeromq-dev mailing list > [email protected] > http://lists.zeromq.org/mailman/listinfo/zeromq-dev > >
_______________________________________________ zeromq-dev mailing list [email protected] http://lists.zeromq.org/mailman/listinfo/zeromq-dev
