You may have missed this thread: http://mail-archives.apache.org/mod_mbox/zookeeper-user/201509.mbox/%3CCAF+A=7rhamcgxewhpr0n+z5-rwvxslxd1ghtth7cp3ozmvg...@mail.gmail.com%3E
What you describe is implemented, perhaps in different in some details, here: http://nirmataoss.github.io/workflow/ On Wed, Sep 9, 2015 at 2:29 PM, Simon <[email protected]> wrote: > Hi > > I have been reading a lot about Zookeeper lately because we have the > requirement to distribute our workflow engine (for performance and > reliability reasons). Currently, we have a single instance. If it fails, no > work is done anymore. One approach mentioned in this mailing list and > online documents is a master - worker setup. The master is made reliable by > using multiple instances and electing a leader. This setup might work > unless there are too many events to be dispatched by a single master. I am > a bit concerned about that as currently the work dispatcher does no IO. It > is blazing fast. If the work dispatcher (scheduler) now has to communicate > over the network it might get too slow. So I thought about a different > solution and would like to hear what you think about it. > > Let’s say each workflow engine instance (referred as node from now on) > registers an ephemeral znode under /nodes. Each node installs a children > watch on /nodes. The nodes uses this information to populate a hashring. > Each node is only responsible for workflow instances that map to their > corresponding part of the hashring. Event notifications would then be > dispatched to the correct node based on the same hashring. The persistence > of workflow instances and events would still be done in a highly available > database. Only notifications about events would be dispatched. Whenever an > event notification is received a workflow instance has to be dispatched to > a given worker thread within a node. > > The interesting cases happen if a new node is added to the cluster or if a > workflow engine instance fails. Let’s talk about failures first. As > processing a workflow instance takes some time we cannot simply switch over > to a new instance right away. After all, the previous node might still be > running (i.e. it did not crash, just failed to heartbeat). We would have to > wait some time (how long?!) until we know that the failing node has dropped > the work it was doing (it gets notified of the session expiration). The > failing node cannot communicate with Zookeeper so it has no way of telling > that it was done. We also cannot use the central database to find out what > happened. If a node fails (e.g. due to hardware failure or network failure) > it cannot update the database. The database will look the same regardless > of whether it is working or has crashed. It must be impossible that two > workflow engines work on the same workflow instance. This would result in > duplicate messages being sent out to backend systems (not a good idea). > > The same issue arrises with adding a node. The new node has to wait until > the other nodes have stopped working. Ok, I could use Zookeeper in this > case to lock workflow instances. Only if the lock is available a node can > start working on an workflow instance. Whenever a node is added it will > look up pending workflow instances in the database (those that have open > events). > > To summarize: instead of having a central master that coordinates all work > I would like to divide the work “space” into segments by using consistent > hashing. Each node is an equal peer and can operate freely within its > assigned work “space”. > > What do you think about that setup? Is it something completely stupid?! If > yes, let me know why. Has somebody done something similar successfully? > What would I have to watch out for? What would I need to add to the basic > setup mentioned above? > > Regards, > Simon
