Quick thought. I saw Stefan's Katta presentation last night. Katta seems nice and simple. If I understood correctly, juicy stuff that is interesting to Solr is: - Katta has a notion of a Primary Master and N Secondary Slaves (no SPOF there) - Search Nodes serve index shards copied locally from some shared storage - Zookeeper instances (again Primary Master and N Secondary Slaves) that facilitate communication among distributed components
The master: -- knows how to distribute a set of index shards it is given across a number of search nodes (distribution policy pluggable, similar to Hadoop's, but different) -- has a map of which shard is on which search node (in Zookeeper) -- knows how to replicate each shard (replication factor configurable) -- knows when a search node goes down (via Zookeeper notification) -- knows how to create more replicas of shards on dead search node (and remove extra replicas when search node is revived) -- can notify search nodes when a new index is available (via Zookeeper) More in: http://joa23.files.wordpress.com/2008/09/katta-overview.pdf Paul Noble will like slide #13 ;) In particular, I think that: - Making use of Zookeper for index snapshot + replication might be useful (Master publishes the info about a new snapshot to Zookier and Search Slaves get notified immediately and start copying the index) - Making use of Zookeper for keeping a map of index shards + applying a replication factor would be very useful - Making use of pluggable shard placement policy would be useful Thoughts? Also: While Katta provides shard->search server functionality via pluggable impl, what both Solr and Katta are still missing is the doc->shard functionality. However, this might not be terribly hard if we do something similar to Katta's pluggable shard->search server distribution policy. Please mind I'm saying this without having looked at any of the Katta code. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
