According to what you describe, I really don't see the need of core discovery in Solr Cloud. It will only be used to eagerly load a core on startup. If I understand correctly, when ZK = truth, this eager loading can/should be done by consulting zookeeper instead of local disk. I agree that it is really confusing. The best strategy that I see form is to stop relying on core.properties and keep it all in zookeeper.
On Wed, Mar 2, 2016 at 7:54 PM, Jeff Wartes <jwar...@whitepages.com> wrote: > Well, with the understanding that someone who isn’t involved in the > process is describing something that isn’t built yet... > > I could imagine changes like: > - Core discovery ignores cores that aren’t present in the ZK cluster state > - New cores are automatically created to bring a node in line with ZK > cluster state (addreplica, essentially) > > So if the clusterstate said “node XYZ has a replica of shard3 of > collection1 and that’s all”, and you downed node XYZ and deleted the data > directory, it’d get restored when you started the node again. And if you > copied the core directory for shard1 of collection2 in there and restarted > the node, it’d get ignored because the clusterstate says node XYZ doesn’t > have that. > > More importantly, if you completely destroyed a node and rebuilt it from > an image, (AWS?) that image wouldn't need any special core directories > specific to that node. As long as the node name was the same, Solr would > handle bringing that node back to where it was in the cluster. > > Back to opinions, I think mixing the cluster definition between local disk > on the nodes and ZK clusterstate is just confusing. It should really be one > or the other. Specifically, I think it should be local disk for > non-SolrCloud, and ZK for SolrCloud. > > > > > > On 3/2/16, 12:13 AM, "danny teichthal" <dannyt...@gmail.com> wrote: > > >Thanks Jeff, > >I understand your philosophy and it sounds correct. > >Since we had many problems with zookeeper when switching to Solr Cloud. we > >couldn't make it as a source of knowledge and had to relay on a more > stable > >source. > >The issues is that when we get such an event of zookeeper, it brought our > >system down, and in this case, clearing the core.properties were a life > >saver. > >We've managed to make it pretty stable not, but we will always need a > >"dooms day" weapon. > > > >I looked into the related JIRA and it confused me a little, and raised a > >few other questions: > >1. What exactly defines zookeeper as a truth? > >2. What is the role of core.properties if the state is only in zookeeper? > > > > > > > >Your tool is very interesting, I just thought about writing such a tool > >myself. > >From the sources I understand that you represent each node as a path in > the > >git repository. > >So, I guess that for restore purposes I will have to do > >the opposite direction and create a node for every path entry. > > > > > > > > > >On Tue, Mar 1, 2016 at 11:36 PM, Jeff Wartes <jwar...@whitepages.com> > wrote: > > > >> > >> I’ve been running SolrCloud clusters in various versions for a few years > >> here, and I can only think of two or three cases that the ZK-stored > cluster > >> state was broken in a way that I had to manually intervene by > hand-editing > >> the contents of ZK. I think I’ve seen Solr fixes go by for those cases, > >> too. I’ve never completely wiped ZK. (Although granted, my ZK cluster > has > >> been pretty stable, and my collection count is smaller than yours) > >> > >> My philosophy is that ZK is the source of cluster configuration, not the > >> collection of core.properties files on the nodes. > >> Currently, cluster state is shared between ZK and core directories. I’d > >> prefer, and I think Solr development is going this way, (SOLR-7269) that > >> all cluster state exist and be managed via ZK, and all state be removed > >> from the local disk of the cluster nodes. The fact that a node uses > local > >> disk based configuration to figure out what collections/replicas it has > is > >> something that should be fixed, in my opinion. > >> > >> If you’re frequently getting into bad states due to ZK issues, I’d > suggest > >> you file bugs against Solr for the fact that you got into the state, and > >> then fix your ZK cluster. > >> > >> Failing that, can you just periodically back up your ZK data and restore > >> it if something breaks? I wrote a little tool to watch clusterstate.json > >> and write every version to a local git repo a few years ago. I was > mostly > >> interested because I wanted to see changes that happened pretty fast, > but > >> it could also serve as a backup approach. Here’s a link, although I > clearly > >> haven’t touched it lately. Feel free to ask if you have issues: > >> https://github.com/randomstatistic/git_zk_monitor > >> > >> > >> > >> > >> On 3/1/16, 12:09 PM, "danny teichthal" <dannyt...@gmail.com> wrote: > >> > >> >Hi, > >> >Just summarizing my questions if the long mail is a little > intimidating: > >> >1. Is there a best practice/automated tool for overcoming problems in > >> >cluster state coming from zookeeper disconnections? > >> >2. Creating a collection via core admin is discouraged, is it true also > >> for > >> >core.properties discovery? > >> > > >> >I would like to be able to specify collection.configName in the > >> >core.properties and when starting server, the collection will be > created > >> >and linked to the config name specified. > >> > > >> > > >> > > >> >On Mon, Feb 29, 2016 at 4:01 PM, danny teichthal <dannyt...@gmail.com> > >> >wrote: > >> > > >> >> Hi, > >> >> > >> >> > >> >> I would like to describe a process we use for overcoming problems in > >> >> cluster state when we have networking issues. Would appreciate if > anyone > >> >> can answer about what are the flaws on this solution and what is the > >> best > >> >> practice for recovery in case of network problems involving > zookeeper. > >> >> I'm working with Solr Cloud with version 5.2.1 > >> >> ~100 collections in a cluster of 6 machines. > >> >> > >> >> This is the short procedure: > >> >> 1. Bring all the cluster down. > >> >> 2. Clear all data from zookeeper. > >> >> 3. Upload configuration. > >> >> 4. Restart the cluster. > >> >> > >> >> We rely on the fact that a collection is created on core discovery > >> >> process, if it does not exist. It gives us much flexibility. > >> >> When the cluster comes up, it reads from core.properties and creates > the > >> >> collections if needed. > >> >> Since we have only one configuration, the collections are > automatically > >> >> linked to it and the cores inherit it from the collection. > >> >> This is a very robust procedure, that helped us overcome many > problems > >> >> until we stabilized our cluster which is now pretty stable. > >> >> I know that the leader might change in such case and may lose > updates, > >> but > >> >> it is ok. > >> >> > >> >> > >> >> The problem is that today I want to add a new config set. > >> >> When I add it and clear zookeeper, the cores cannot be created > because > >> >> there are 2 configurations. This breaks my recovery procedure. > >> >> > >> >> I thought about a few options: > >> >> 1. Put the config Name in core.properties - this doesn't work. (It is > >> >> supported in CoreAdminHandler, but is discouraged according to > >> >> documentation) > >> >> 2. Change recovery procedure to not delete all data from zookeeper, > but > >> >> only relevant parts. > >> >> 3. Change recovery procedure to delete all, but recreate and link > >> >> configurations for all collections before startup. > >> >> > >> >> Option #1 is my favorite, because it is very simple, it is currently > not > >> >> supported, but from looking on code it looked like it is not complex > to > >> >> implement. > >> >> > >> >> > >> >> > >> >> My questions are: > >> >> 1. Is there something wrong in the recovery procedure that I > described ? > >> >> 2. What is the best way to fix problems in cluster state, except from > >> >> editing clusterstate.json manually? Is there an automated tool for > >> that? We > >> >> have about 100 collections in a cluster, so editing is not really a > >> >> solution. > >> >> 3.Is creating a collection via core.properties is also discouraged? > >> >> > >> >> > >> >> > >> >> Would very appreciate any answers/ thoughts on that. > >> >> > >> >> > >> >> Thanks, > >> >> > >> >> > >> >> > >> >> > >> >> > >> >> > >> >