Hi all, Thanks for the help offered by people on the list (esp Ted Dunning, Mitchi Mutsuzaki and Vitalli Tymchyshyn for initial pointers). We were able to make some progress on Partitioned ZooKeeper ( http://wiki.apache.org/hadoop/ZooKeeper/PartitionedZookeeper). Our project report is available here: http://www.slideshare.net/pramodbiligiri/zookeeper-partitioningprojectreport
We modified the ZooKeeper code to create a rudimentary proof-of-concept implementation to test only the performance critical parts (the sharding logic is hardcoded, leader failure/election is not handled etc) Our experiments showed a 14% increase in throughput. Unfortunately we didn't have time to drill down into the performance improvement in more detail. For example, we couldn't successfully turn off disk-write caching on EBS volumes, run with 3 quorums on each node, or try our changes on a larger cluster. So I would take that value with a grain of salt. Comments and criticism are welcome. If people see some value in this work, I'll try to run some more experiments. The code is here: https://bitbucket.org/pramodbiligiri/zk-multileader The code changes were not much: We added a QuorumPeerThread class to QuorumPeerMain that launches multiple QuorumPeers on each node, and a client request is routed to the appropriate QuorumPeer by examining its path. The existence of multiple quorums is abstracted in most other places by a class QuorumPeerMulti that wraps multiple QuorumPeer-s and acts as a facade. Note that each quorum on each node writes to its own transaction log device and data directory. Pramod -- http://twitter.com/pramodbiligiri
