Nice article by Pinterest folks on Zookeeper as SPoF. http://engineering.pinterest.com/post/77933733851/zookeeper-resilience-at-pinterest
Though I agree with the problems, not sure I would go the extent of having separate daemons, it introduces more fault points. However, with Helix we have designed the system to continue to work in the current state if Zookeeper crashes. Atleast I had that goal during initial coding phase. Basically the system to work as if nothing happened. The only compromise is that no more transitions can happen in the system while zookeeper is down. Should we add an integration test to always guarantee this property. Is this valuable. thanks, Kishore G
