Re: SolrCloud - Strategy for recovering cluster states

danny teichthal Wed, 02 Mar 2016 22:41:51 -0800

According to what you describe, I really don't see the need of core
discovery in Solr Cloud. It will only be used to eagerly load a core on
startup.
If I understand correctly, when ZK = truth, this eager loading can/should
be done by consulting zookeeper instead of local disk.
I agree that it is really confusing.
The best strategy that I see form is to stop relying on core.properties and
keep it all in zookeeper.



On Wed, Mar 2, 2016 at 7:54 PM, Jeff Wartes <jwar...@whitepages.com> wrote:

> Well, with the understanding that someone who isn’t involved in the
> process is describing something that isn’t built yet...
>
> I could imagine changes like:
>  - Core discovery ignores cores that aren’t present in the ZK cluster state
>  - New cores are automatically created to bring a node in line with ZK
> cluster state (addreplica, essentially)
>
> So if the clusterstate said “node XYZ has a replica of shard3 of
> collection1 and that’s all”, and you downed node XYZ and deleted the data
> directory, it’d get restored when you started the node again. And if you
> copied the core directory for shard1 of collection2 in there and restarted
> the node, it’d get ignored because the clusterstate says node XYZ doesn’t
> have that.
>
> More importantly, if you completely destroyed a node and rebuilt it from
> an image, (AWS?) that image wouldn't need any special core directories
> specific to that node. As long as the node name was the same, Solr would
> handle bringing that node back to where it was in the cluster.
>
> Back to opinions, I think mixing the cluster definition between local disk
> on the nodes and ZK clusterstate is just confusing. It should really be one
> or the other. Specifically, I think it should be local disk for
> non-SolrCloud, and ZK for SolrCloud.
>
>
>
>
>
> On 3/2/16, 12:13 AM, "danny teichthal" <dannyt...@gmail.com> wrote:
>
> >Thanks Jeff,
> >I understand your philosophy and it sounds correct.
> >Since we had many problems with zookeeper when switching to Solr Cloud. we
> >couldn't make it as a source of knowledge and had to relay on a more
> stable
> >source.
> >The issues is that when we get such an event of zookeeper, it brought our
> >system down, and in this case, clearing the core.properties were a life
> >saver.
> >We've managed to make it pretty stable not, but we will always need a
> >"dooms day" weapon.
> >
> >I looked into the related JIRA and it confused me a little, and raised a
> >few other questions:
> >1. What exactly defines zookeeper as a truth?
> >2. What is the role of core.properties if the state is only in zookeeper?
> >
> >
> >
> >Your tool is very interesting, I just thought about writing such a tool
> >myself.
> >From the sources I understand that you represent each node as a path in
> the
> >git repository.
> >So, I guess that for restore purposes I will have to do
> >the opposite direction and create a node for every path entry.
> >
> >
> >
> >
> >On Tue, Mar 1, 2016 at 11:36 PM, Jeff Wartes <jwar...@whitepages.com>
> wrote:
> >
> >>
> >> I’ve been running SolrCloud clusters in various versions for a few years
> >> here, and I can only think of two or three cases that the ZK-stored
> cluster
> >> state was broken in a way that I had to manually intervene by
> hand-editing
> >> the contents of ZK. I think I’ve seen Solr fixes go by for those cases,
> >> too. I’ve never completely wiped ZK. (Although granted, my ZK cluster
> has
> >> been pretty stable, and my collection count is smaller than yours)
> >>
> >> My philosophy is that ZK is the source of cluster configuration, not the
> >> collection of core.properties files on the nodes.
> >> Currently, cluster state is shared between ZK and core directories. I’d
> >> prefer, and I think Solr development is going this way, (SOLR-7269) that
> >> all cluster state exist and be managed via ZK, and all state be removed
> >> from the local disk of the cluster nodes. The fact that a node uses
> local
> >> disk based configuration to figure out what collections/replicas it has
> is
> >> something that should be fixed, in my opinion.
> >>
> >> If you’re frequently getting into bad states due to ZK issues, I’d
> suggest
> >> you file bugs against Solr for the fact that you got into the state, and
> >> then fix your ZK cluster.
> >>
> >> Failing that, can you just periodically back up your ZK data and restore
> >> it if something breaks? I wrote a little tool to watch clusterstate.json
> >> and write every version to a local git repo a few years ago. I was
> mostly
> >> interested because I wanted to see changes that happened pretty fast,
> but
> >> it could also serve as a backup approach. Here’s a link, although I
> clearly
> >> haven’t touched it lately. Feel free to ask if you have issues:
> >> https://github.com/randomstatistic/git_zk_monitor
> >>
> >>
> >>
> >>
> >> On 3/1/16, 12:09 PM, "danny teichthal" <dannyt...@gmail.com> wrote:
> >>
> >> >Hi,
> >> >Just summarizing my questions if the long mail is a little
> intimidating:
> >> >1. Is there a best practice/automated tool for overcoming problems in
> >> >cluster state coming from zookeeper disconnections?
> >> >2. Creating a collection via core admin is discouraged, is it true also
> >> for
> >> >core.properties discovery?
> >> >
> >> >I would like to be able to specify collection.configName in the
> >> >core.properties and when starting server, the collection will be
> created
> >> >and linked to the config name specified.
> >> >
> >> >
> >> >
> >> >On Mon, Feb 29, 2016 at 4:01 PM, danny teichthal <dannyt...@gmail.com>
> >> >wrote:
> >> >
> >> >> Hi,
> >> >>
> >> >>
> >> >> I would like to describe a process we use for overcoming problems in
> >> >> cluster state when we have networking issues. Would appreciate if
> anyone
> >> >> can answer about what are the flaws on this solution and what is the
> >> best
> >> >> practice for recovery in case of network problems involving
> zookeeper.
> >> >> I'm working with Solr Cloud with version 5.2.1
> >> >> ~100 collections in a cluster of 6 machines.
> >> >>
> >> >> This is the short procedure:
> >> >> 1. Bring all the cluster down.
> >> >> 2. Clear all data from zookeeper.
> >> >> 3. Upload configuration.
> >> >> 4. Restart the cluster.
> >> >>
> >> >> We rely on the fact that a collection is created on core discovery
> >> >> process, if it does not exist. It gives us much flexibility.
> >> >> When the cluster comes up, it reads from core.properties and creates
> the
> >> >> collections if needed.
> >> >> Since we have only one configuration, the collections are
> automatically
> >> >> linked to it and the cores inherit it from the collection.
> >> >> This is a very robust procedure, that helped us overcome many
> problems
> >> >> until we stabilized our cluster which is now pretty stable.
> >> >> I know that the leader might change in such case and may lose
> updates,
> >> but
> >> >> it is ok.
> >> >>
> >> >>
> >> >> The problem is that today I want to add a new config set.
> >> >> When I add it and clear zookeeper, the cores cannot be created
> because
> >> >> there are 2 configurations. This breaks my recovery procedure.
> >> >>
> >> >> I thought about a few options:
> >> >> 1. Put the config Name in core.properties - this doesn't work. (It is
> >> >> supported in CoreAdminHandler, but  is discouraged according to
> >> >> documentation)
> >> >> 2. Change recovery procedure to not delete all data from zookeeper,
> but
> >> >> only relevant parts.
> >> >> 3. Change recovery procedure to delete all, but recreate and link
> >> >> configurations for all collections before startup.
> >> >>
> >> >> Option #1 is my favorite, because it is very simple, it is currently
> not
> >> >> supported, but from looking on code it looked like it is not complex
> to
> >> >> implement.
> >> >>
> >> >>
> >> >>
> >> >> My questions are:
> >> >> 1. Is there something wrong in the recovery procedure that I
> described ?
> >> >> 2. What is the best way to fix problems in cluster state, except from
> >> >> editing clusterstate.json manually? Is there an automated tool for
> >> that? We
> >> >> have about 100 collections in a cluster, so editing is not really a
> >> >> solution.
> >> >> 3.Is creating a collection via core.properties is also discouraged?
> >> >>
> >> >>
> >> >>
> >> >> Would very appreciate any answers/ thoughts on that.
> >> >>
> >> >>
> >> >> Thanks,
> >> >>
> >> >>
> >> >>
> >> >>
> >> >>
> >> >>
> >>
>

Re: SolrCloud - Strategy for recovering cluster states

Reply via email to