I brought this up on IRC, but mahadev pointed me here so more people could benefit from the discussion.
I was primarily asking about the usage at Yahoo, but if you're reading along, have information about a large zookeeper deployment, and would like to share, please do so :) I'm interested in the patterns used for zookeeper at large scale. For a point of reference, I consider "large" to be anything above a coupla thousand clients. In this scope, here's a list of stuff I'm curious about: 1) I'd like to know if zookeeper is deployed in an "ensemble per application" model, or if there are deployments that use a "one general purpose ensemble per cluster" model, or maybe something else? 2) When the number of clients gets very large, are there tricks beyond simply scaling out observers with read load? 3) How is zk server discovery handled in a large environment? Hardcoded IPs? DNS aliases? Something else? 4) This was brought up on the list recently: is there a strategy for managing the ensemble member replacment problem? It's pretty undesirable to restart clients to learn about a replaced machine. Those are all of my specific questions ... but if there's other info that anyone feels is pertinent to this topic, please don't be shy :) .laz