> ZK seemed pretty darned stable through all of this.
Sounds like a nice test, and it's great to hear that ZooKeeper works well there.
> The only instability that I saw was caused by excessive amounts of data in
> ZK itself. As I neared the (small) amount of memory I had allocated for Zk
> use, I would see servers go into paroxysms of GC, but the cluster
> functionality was impaired to a very surprisingly small degree.
Cool, makes sense.
> No. I considered it, but I wanted fewer moving parts rather than more.
> Doing that would make the intricate and unlikely failure mode that Henry
> asked about even less likely, but I don't know if it would increase or
> decrease the probability of any kind of failure.
Yeah, I guess it depends a bit on the system architecture too. If the
system is designed in such a way that ZK is keeping track of
coordination data which must be resumed after a full stop of the
system, having it stored in persistent data would prevent important
loss of information. If ZK is really just coordinating ephemeral data
(e.g. locks), then if the whole system goes down, it's ok to just
allow it to start up again in an empty state.
> The observed failure modes for ZK in EC2 were completely dominated by our
> (my) own failings (such as letting too much data accumulate).
Details always take a few iterations to get really right.
Thanks for this data Ted.