Hi,
I would like to describe a process we use for overcoming problems in cluster state when we have networking issues. Would appreciate if anyone can answer about what are the flaws on this solution and what is the best practice for recovery in case of network problems involving zookeeper. I'm working with Solr Cloud with version 5.2.1 ~100 collections in a cluster of 6 machines. This is the short procedure: 1. Bring all the cluster down. 2. Clear all data from zookeeper. 3. Upload configuration. 4. Restart the cluster. We rely on the fact that a collection is created on core discovery process, if it does not exist. It gives us much flexibility. When the cluster comes up, it reads from core.properties and creates the collections if needed. Since we have only one configuration, the collections are automatically linked to it and the cores inherit it from the collection. This is a very robust procedure, that helped us overcome many problems until we stabilized our cluster which is now pretty stable. I know that the leader might change in such case and may lose updates, but it is ok. The problem is that today I want to add a new config set. When I add it and clear zookeeper, the cores cannot be created because there are 2 configurations. This breaks my recovery procedure. I thought about a few options: 1. Put the config Name in core.properties - this doesn't work. (It is supported in CoreAdminHandler, but is discouraged according to documentation) 2. Change recovery procedure to not delete all data from zookeeper, but only relevant parts. 3. Change recovery procedure to delete all, but recreate and link configurations for all collections before startup. Option #1 is my favorite, because it is very simple, it is currently not supported, but from looking on code it looked like it is not complex to implement. My questions are: 1. Is there something wrong in the recovery procedure that I described ? 2. What is the best way to fix problems in cluster state, except from editing clusterstate.json manually? Is there an automated tool for that? We have about 100 collections in a cluster, so editing is not really a solution. 3.Is creating a collection via core.properties is also discouraged? Would very appreciate any answers/ thoughts on that. Thanks,