Hi,

I would like to describe a process we use for overcoming problems in
cluster state when we have networking issues. Would appreciate if anyone
can answer about what are the flaws on this solution and what is the best
practice for recovery in case of network problems involving zookeeper.
I'm working with Solr Cloud with version 5.2.1
~100 collections in a cluster of 6 machines.

This is the short procedure:
1. Bring all the cluster down.
2. Clear all data from zookeeper.
3. Upload configuration.
4. Restart the cluster.

We rely on the fact that a collection is created on core discovery process,
if it does not exist. It gives us much flexibility.
When the cluster comes up, it reads from core.properties and creates the
collections if needed.
Since we have only one configuration, the collections are automatically
linked to it and the cores inherit it from the collection.
This is a very robust procedure, that helped us overcome many problems
until we stabilized our cluster which is now pretty stable.
I know that the leader might change in such case and may lose updates, but
it is ok.


The problem is that today I want to add a new config set.
When I add it and clear zookeeper, the cores cannot be created because
there are 2 configurations. This breaks my recovery procedure.

I thought about a few options:
1. Put the config Name in core.properties - this doesn't work. (It is
supported in CoreAdminHandler, but  is discouraged according to
documentation)
2. Change recovery procedure to not delete all data from zookeeper, but
only relevant parts.
3. Change recovery procedure to delete all, but recreate and link
configurations for all collections before startup.

Option #1 is my favorite, because it is very simple, it is currently not
supported, but from looking on code it looked like it is not complex to
implement.



My questions are:
1. Is there something wrong in the recovery procedure that I described ?
2. What is the best way to fix problems in cluster state, except from
editing clusterstate.json manually? Is there an automated tool for that? We
have about 100 collections in a cluster, so editing is not really a
solution.
3.Is creating a collection via core.properties is also discouraged?



Would very appreciate any answers/ thoughts on that.


Thanks,

Reply via email to