Re: SolrCloud - Strategy for recovering cluster states

2016-03-02 Thread danny teichthal
According to what you describe, I really don't see the need of core
discovery in Solr Cloud. It will only be used to eagerly load a core on
startup.
If I understand correctly, when ZK = truth, this eager loading can/should
be done by consulting zookeeper instead of local disk.
I agree that it is really confusing.
The best strategy that I see form is to stop relying on core.properties and
keep it all in zookeeper.


On Wed, Mar 2, 2016 at 7:54 PM, Jeff Wartes  wrote:

> Well, with the understanding that someone who isn’t involved in the
> process is describing something that isn’t built yet...
>
> I could imagine changes like:
>  - Core discovery ignores cores that aren’t present in the ZK cluster state
>  - New cores are automatically created to bring a node in line with ZK
> cluster state (addreplica, essentially)
>
> So if the clusterstate said “node XYZ has a replica of shard3 of
> collection1 and that’s all”, and you downed node XYZ and deleted the data
> directory, it’d get restored when you started the node again. And if you
> copied the core directory for shard1 of collection2 in there and restarted
> the node, it’d get ignored because the clusterstate says node XYZ doesn’t
> have that.
>
> More importantly, if you completely destroyed a node and rebuilt it from
> an image, (AWS?) that image wouldn't need any special core directories
> specific to that node. As long as the node name was the same, Solr would
> handle bringing that node back to where it was in the cluster.
>
> Back to opinions, I think mixing the cluster definition between local disk
> on the nodes and ZK clusterstate is just confusing. It should really be one
> or the other. Specifically, I think it should be local disk for
> non-SolrCloud, and ZK for SolrCloud.
>
>
>
>
>
> On 3/2/16, 12:13 AM, "danny teichthal"  wrote:
>
> >Thanks Jeff,
> >I understand your philosophy and it sounds correct.
> >Since we had many problems with zookeeper when switching to Solr Cloud. we
> >couldn't make it as a source of knowledge and had to relay on a more
> stable
> >source.
> >The issues is that when we get such an event of zookeeper, it brought our
> >system down, and in this case, clearing the core.properties were a life
> >saver.
> >We've managed to make it pretty stable not, but we will always need a
> >"dooms day" weapon.
> >
> >I looked into the related JIRA and it confused me a little, and raised a
> >few other questions:
> >1. What exactly defines zookeeper as a truth?
> >2. What is the role of core.properties if the state is only in zookeeper?
> >
> >
> >
> >Your tool is very interesting, I just thought about writing such a tool
> >myself.
> >From the sources I understand that you represent each node as a path in
> the
> >git repository.
> >So, I guess that for restore purposes I will have to do
> >the opposite direction and create a node for every path entry.
> >
> >
> >
> >
> >On Tue, Mar 1, 2016 at 11:36 PM, Jeff Wartes 
> wrote:
> >
> >>
> >> I’ve been running SolrCloud clusters in various versions for a few years
> >> here, and I can only think of two or three cases that the ZK-stored
> cluster
> >> state was broken in a way that I had to manually intervene by
> hand-editing
> >> the contents of ZK. I think I’ve seen Solr fixes go by for those cases,
> >> too. I’ve never completely wiped ZK. (Although granted, my ZK cluster
> has
> >> been pretty stable, and my collection count is smaller than yours)
> >>
> >> My philosophy is that ZK is the source of cluster configuration, not the
> >> collection of core.properties files on the nodes.
> >> Currently, cluster state is shared between ZK and core directories. I’d
> >> prefer, and I think Solr development is going this way, (SOLR-7269) that
> >> all cluster state exist and be managed via ZK, and all state be removed
> >> from the local disk of the cluster nodes. The fact that a node uses
> local
> >> disk based configuration to figure out what collections/replicas it has
> is
> >> something that should be fixed, in my opinion.
> >>
> >> If you’re frequently getting into bad states due to ZK issues, I’d
> suggest
> >> you file bugs against Solr for the fact that you got into the state, and
> >> then fix your ZK cluster.
> >>
> >> Failing that, can you just periodically back up your ZK data and restore
> >> it if something breaks? I wrote a little tool to watch clusterstate.json
> >> and write every version to a local git repo a few years ago. I was
> mostly
> >> interested because I wanted to see changes that happened pretty fast,
> but
> >> it could also serve as a backup approach. Here’s a link, although I
> clearly
> >> haven’t touched it lately. Feel free to ask if you have issues:
> >> https://github.com/randomstatistic/git_zk_monitor
> >>
> >>
> >>
> >>
> >> On 3/1/16, 12:09 PM, "danny teichthal"  wrote:
> >>
> >> >Hi,
> >> >Just summarizing my questions if the long mail is a little
> 

Re: SolrCloud - Strategy for recovering cluster states

2016-03-02 Thread Jeff Wartes
Well, with the understanding that someone who isn’t involved in the process is 
describing something that isn’t built yet...

I could imagine changes like:
 - Core discovery ignores cores that aren’t present in the ZK cluster state
 - New cores are automatically created to bring a node in line with ZK cluster 
state (addreplica, essentially) 
 
So if the clusterstate said “node XYZ has a replica of shard3 of collection1 
and that’s all”, and you downed node XYZ and deleted the data directory, it’d 
get restored when you started the node again. And if you copied the core 
directory for shard1 of collection2 in there and restarted the node, it’d get 
ignored because the clusterstate says node XYZ doesn’t have that.

More importantly, if you completely destroyed a node and rebuilt it from an 
image, (AWS?) that image wouldn't need any special core directories specific to 
that node. As long as the node name was the same, Solr would handle bringing 
that node back to where it was in the cluster.

Back to opinions, I think mixing the cluster definition between local disk on 
the nodes and ZK clusterstate is just confusing. It should really be one or the 
other. Specifically, I think it should be local disk for non-SolrCloud, and ZK 
for SolrCloud.





On 3/2/16, 12:13 AM, "danny teichthal"  wrote:

>Thanks Jeff,
>I understand your philosophy and it sounds correct.
>Since we had many problems with zookeeper when switching to Solr Cloud. we
>couldn't make it as a source of knowledge and had to relay on a more stable
>source.
>The issues is that when we get such an event of zookeeper, it brought our
>system down, and in this case, clearing the core.properties were a life
>saver.
>We've managed to make it pretty stable not, but we will always need a
>"dooms day" weapon.
>
>I looked into the related JIRA and it confused me a little, and raised a
>few other questions:
>1. What exactly defines zookeeper as a truth?
>2. What is the role of core.properties if the state is only in zookeeper?
>
>
>
>Your tool is very interesting, I just thought about writing such a tool
>myself.
>From the sources I understand that you represent each node as a path in the
>git repository.
>So, I guess that for restore purposes I will have to do
>the opposite direction and create a node for every path entry.
>
>
>
>
>On Tue, Mar 1, 2016 at 11:36 PM, Jeff Wartes  wrote:
>
>>
>> I’ve been running SolrCloud clusters in various versions for a few years
>> here, and I can only think of two or three cases that the ZK-stored cluster
>> state was broken in a way that I had to manually intervene by hand-editing
>> the contents of ZK. I think I’ve seen Solr fixes go by for those cases,
>> too. I’ve never completely wiped ZK. (Although granted, my ZK cluster has
>> been pretty stable, and my collection count is smaller than yours)
>>
>> My philosophy is that ZK is the source of cluster configuration, not the
>> collection of core.properties files on the nodes.
>> Currently, cluster state is shared between ZK and core directories. I’d
>> prefer, and I think Solr development is going this way, (SOLR-7269) that
>> all cluster state exist and be managed via ZK, and all state be removed
>> from the local disk of the cluster nodes. The fact that a node uses local
>> disk based configuration to figure out what collections/replicas it has is
>> something that should be fixed, in my opinion.
>>
>> If you’re frequently getting into bad states due to ZK issues, I’d suggest
>> you file bugs against Solr for the fact that you got into the state, and
>> then fix your ZK cluster.
>>
>> Failing that, can you just periodically back up your ZK data and restore
>> it if something breaks? I wrote a little tool to watch clusterstate.json
>> and write every version to a local git repo a few years ago. I was mostly
>> interested because I wanted to see changes that happened pretty fast, but
>> it could also serve as a backup approach. Here’s a link, although I clearly
>> haven’t touched it lately. Feel free to ask if you have issues:
>> https://github.com/randomstatistic/git_zk_monitor
>>
>>
>>
>>
>> On 3/1/16, 12:09 PM, "danny teichthal"  wrote:
>>
>> >Hi,
>> >Just summarizing my questions if the long mail is a little intimidating:
>> >1. Is there a best practice/automated tool for overcoming problems in
>> >cluster state coming from zookeeper disconnections?
>> >2. Creating a collection via core admin is discouraged, is it true also
>> for
>> >core.properties discovery?
>> >
>> >I would like to be able to specify collection.configName in the
>> >core.properties and when starting server, the collection will be created
>> >and linked to the config name specified.
>> >
>> >
>> >
>> >On Mon, Feb 29, 2016 at 4:01 PM, danny teichthal 
>> >wrote:
>> >
>> >> Hi,
>> >>
>> >>
>> >> I would like to describe a process we use for overcoming problems in
>> >> cluster state when we have networking 

Re: SolrCloud - Strategy for recovering cluster states

2016-03-02 Thread danny teichthal
Thanks Jeff,
I understand your philosophy and it sounds correct.
Since we had many problems with zookeeper when switching to Solr Cloud. we
couldn't make it as a source of knowledge and had to relay on a more stable
source.
The issues is that when we get such an event of zookeeper, it brought our
system down, and in this case, clearing the core.properties were a life
saver.
We've managed to make it pretty stable not, but we will always need a
"dooms day" weapon.

I looked into the related JIRA and it confused me a little, and raised a
few other questions:
1. What exactly defines zookeeper as a truth?
2. What is the role of core.properties if the state is only in zookeeper?



Your tool is very interesting, I just thought about writing such a tool
myself.
>From the sources I understand that you represent each node as a path in the
git repository.
So, I guess that for restore purposes I will have to do
the opposite direction and create a node for every path entry.




On Tue, Mar 1, 2016 at 11:36 PM, Jeff Wartes  wrote:

>
> I’ve been running SolrCloud clusters in various versions for a few years
> here, and I can only think of two or three cases that the ZK-stored cluster
> state was broken in a way that I had to manually intervene by hand-editing
> the contents of ZK. I think I’ve seen Solr fixes go by for those cases,
> too. I’ve never completely wiped ZK. (Although granted, my ZK cluster has
> been pretty stable, and my collection count is smaller than yours)
>
> My philosophy is that ZK is the source of cluster configuration, not the
> collection of core.properties files on the nodes.
> Currently, cluster state is shared between ZK and core directories. I’d
> prefer, and I think Solr development is going this way, (SOLR-7269) that
> all cluster state exist and be managed via ZK, and all state be removed
> from the local disk of the cluster nodes. The fact that a node uses local
> disk based configuration to figure out what collections/replicas it has is
> something that should be fixed, in my opinion.
>
> If you’re frequently getting into bad states due to ZK issues, I’d suggest
> you file bugs against Solr for the fact that you got into the state, and
> then fix your ZK cluster.
>
> Failing that, can you just periodically back up your ZK data and restore
> it if something breaks? I wrote a little tool to watch clusterstate.json
> and write every version to a local git repo a few years ago. I was mostly
> interested because I wanted to see changes that happened pretty fast, but
> it could also serve as a backup approach. Here’s a link, although I clearly
> haven’t touched it lately. Feel free to ask if you have issues:
> https://github.com/randomstatistic/git_zk_monitor
>
>
>
>
> On 3/1/16, 12:09 PM, "danny teichthal"  wrote:
>
> >Hi,
> >Just summarizing my questions if the long mail is a little intimidating:
> >1. Is there a best practice/automated tool for overcoming problems in
> >cluster state coming from zookeeper disconnections?
> >2. Creating a collection via core admin is discouraged, is it true also
> for
> >core.properties discovery?
> >
> >I would like to be able to specify collection.configName in the
> >core.properties and when starting server, the collection will be created
> >and linked to the config name specified.
> >
> >
> >
> >On Mon, Feb 29, 2016 at 4:01 PM, danny teichthal 
> >wrote:
> >
> >> Hi,
> >>
> >>
> >> I would like to describe a process we use for overcoming problems in
> >> cluster state when we have networking issues. Would appreciate if anyone
> >> can answer about what are the flaws on this solution and what is the
> best
> >> practice for recovery in case of network problems involving zookeeper.
> >> I'm working with Solr Cloud with version 5.2.1
> >> ~100 collections in a cluster of 6 machines.
> >>
> >> This is the short procedure:
> >> 1. Bring all the cluster down.
> >> 2. Clear all data from zookeeper.
> >> 3. Upload configuration.
> >> 4. Restart the cluster.
> >>
> >> We rely on the fact that a collection is created on core discovery
> >> process, if it does not exist. It gives us much flexibility.
> >> When the cluster comes up, it reads from core.properties and creates the
> >> collections if needed.
> >> Since we have only one configuration, the collections are automatically
> >> linked to it and the cores inherit it from the collection.
> >> This is a very robust procedure, that helped us overcome many problems
> >> until we stabilized our cluster which is now pretty stable.
> >> I know that the leader might change in such case and may lose updates,
> but
> >> it is ok.
> >>
> >>
> >> The problem is that today I want to add a new config set.
> >> When I add it and clear zookeeper, the cores cannot be created because
> >> there are 2 configurations. This breaks my recovery procedure.
> >>
> >> I thought about a few options:
> >> 1. Put the config Name in core.properties - this doesn't work. 

Re: SolrCloud - Strategy for recovering cluster states

2016-03-01 Thread Jeff Wartes

I’ve been running SolrCloud clusters in various versions for a few years here, 
and I can only think of two or three cases that the ZK-stored cluster state was 
broken in a way that I had to manually intervene by hand-editing the contents 
of ZK. I think I’ve seen Solr fixes go by for those cases, too. I’ve never 
completely wiped ZK. (Although granted, my ZK cluster has been pretty stable, 
and my collection count is smaller than yours)

My philosophy is that ZK is the source of cluster configuration, not the 
collection of core.properties files on the nodes. 
Currently, cluster state is shared between ZK and core directories. I’d prefer, 
and I think Solr development is going this way, (SOLR-7269) that all cluster 
state exist and be managed via ZK, and all state be removed from the local disk 
of the cluster nodes. The fact that a node uses local disk based configuration 
to figure out what collections/replicas it has is something that should be 
fixed, in my opinion.

If you’re frequently getting into bad states due to ZK issues, I’d suggest you 
file bugs against Solr for the fact that you got into the state, and then fix 
your ZK cluster.

Failing that, can you just periodically back up your ZK data and restore it if 
something breaks? I wrote a little tool to watch clusterstate.json and write 
every version to a local git repo a few years ago. I was mostly interested 
because I wanted to see changes that happened pretty fast, but it could also 
serve as a backup approach. Here’s a link, although I clearly haven’t touched 
it lately. Feel free to ask if you have issues: 
https://github.com/randomstatistic/git_zk_monitor




On 3/1/16, 12:09 PM, "danny teichthal"  wrote:

>Hi,
>Just summarizing my questions if the long mail is a little intimidating:
>1. Is there a best practice/automated tool for overcoming problems in
>cluster state coming from zookeeper disconnections?
>2. Creating a collection via core admin is discouraged, is it true also for
>core.properties discovery?
>
>I would like to be able to specify collection.configName in the
>core.properties and when starting server, the collection will be created
>and linked to the config name specified.
>
>
>
>On Mon, Feb 29, 2016 at 4:01 PM, danny teichthal 
>wrote:
>
>> Hi,
>>
>>
>> I would like to describe a process we use for overcoming problems in
>> cluster state when we have networking issues. Would appreciate if anyone
>> can answer about what are the flaws on this solution and what is the best
>> practice for recovery in case of network problems involving zookeeper.
>> I'm working with Solr Cloud with version 5.2.1
>> ~100 collections in a cluster of 6 machines.
>>
>> This is the short procedure:
>> 1. Bring all the cluster down.
>> 2. Clear all data from zookeeper.
>> 3. Upload configuration.
>> 4. Restart the cluster.
>>
>> We rely on the fact that a collection is created on core discovery
>> process, if it does not exist. It gives us much flexibility.
>> When the cluster comes up, it reads from core.properties and creates the
>> collections if needed.
>> Since we have only one configuration, the collections are automatically
>> linked to it and the cores inherit it from the collection.
>> This is a very robust procedure, that helped us overcome many problems
>> until we stabilized our cluster which is now pretty stable.
>> I know that the leader might change in such case and may lose updates, but
>> it is ok.
>>
>>
>> The problem is that today I want to add a new config set.
>> When I add it and clear zookeeper, the cores cannot be created because
>> there are 2 configurations. This breaks my recovery procedure.
>>
>> I thought about a few options:
>> 1. Put the config Name in core.properties - this doesn't work. (It is
>> supported in CoreAdminHandler, but  is discouraged according to
>> documentation)
>> 2. Change recovery procedure to not delete all data from zookeeper, but
>> only relevant parts.
>> 3. Change recovery procedure to delete all, but recreate and link
>> configurations for all collections before startup.
>>
>> Option #1 is my favorite, because it is very simple, it is currently not
>> supported, but from looking on code it looked like it is not complex to
>> implement.
>>
>>
>>
>> My questions are:
>> 1. Is there something wrong in the recovery procedure that I described ?
>> 2. What is the best way to fix problems in cluster state, except from
>> editing clusterstate.json manually? Is there an automated tool for that? We
>> have about 100 collections in a cluster, so editing is not really a
>> solution.
>> 3.Is creating a collection via core.properties is also discouraged?
>>
>>
>>
>> Would very appreciate any answers/ thoughts on that.
>>
>>
>> Thanks,
>>
>>
>>
>>
>>
>>


Re: SolrCloud - Strategy for recovering cluster states

2016-03-01 Thread danny teichthal
Hi,
Just summarizing my questions if the long mail is a little intimidating:
1. Is there a best practice/automated tool for overcoming problems in
cluster state coming from zookeeper disconnections?
2. Creating a collection via core admin is discouraged, is it true also for
core.properties discovery?

I would like to be able to specify collection.configName in the
core.properties and when starting server, the collection will be created
and linked to the config name specified.



On Mon, Feb 29, 2016 at 4:01 PM, danny teichthal 
wrote:

> Hi,
>
>
> I would like to describe a process we use for overcoming problems in
> cluster state when we have networking issues. Would appreciate if anyone
> can answer about what are the flaws on this solution and what is the best
> practice for recovery in case of network problems involving zookeeper.
> I'm working with Solr Cloud with version 5.2.1
> ~100 collections in a cluster of 6 machines.
>
> This is the short procedure:
> 1. Bring all the cluster down.
> 2. Clear all data from zookeeper.
> 3. Upload configuration.
> 4. Restart the cluster.
>
> We rely on the fact that a collection is created on core discovery
> process, if it does not exist. It gives us much flexibility.
> When the cluster comes up, it reads from core.properties and creates the
> collections if needed.
> Since we have only one configuration, the collections are automatically
> linked to it and the cores inherit it from the collection.
> This is a very robust procedure, that helped us overcome many problems
> until we stabilized our cluster which is now pretty stable.
> I know that the leader might change in such case and may lose updates, but
> it is ok.
>
>
> The problem is that today I want to add a new config set.
> When I add it and clear zookeeper, the cores cannot be created because
> there are 2 configurations. This breaks my recovery procedure.
>
> I thought about a few options:
> 1. Put the config Name in core.properties - this doesn't work. (It is
> supported in CoreAdminHandler, but  is discouraged according to
> documentation)
> 2. Change recovery procedure to not delete all data from zookeeper, but
> only relevant parts.
> 3. Change recovery procedure to delete all, but recreate and link
> configurations for all collections before startup.
>
> Option #1 is my favorite, because it is very simple, it is currently not
> supported, but from looking on code it looked like it is not complex to
> implement.
>
>
>
> My questions are:
> 1. Is there something wrong in the recovery procedure that I described ?
> 2. What is the best way to fix problems in cluster state, except from
> editing clusterstate.json manually? Is there an automated tool for that? We
> have about 100 collections in a cluster, so editing is not really a
> solution.
> 3.Is creating a collection via core.properties is also discouraged?
>
>
>
> Would very appreciate any answers/ thoughts on that.
>
>
> Thanks,
>
>
>
>
>
>


SolrCloud - Strategy for recovering cluster states

2016-02-29 Thread danny teichthal
Hi,


I would like to describe a process we use for overcoming problems in
cluster state when we have networking issues. Would appreciate if anyone
can answer about what are the flaws on this solution and what is the best
practice for recovery in case of network problems involving zookeeper.
I'm working with Solr Cloud with version 5.2.1
~100 collections in a cluster of 6 machines.

This is the short procedure:
1. Bring all the cluster down.
2. Clear all data from zookeeper.
3. Upload configuration.
4. Restart the cluster.

We rely on the fact that a collection is created on core discovery process,
if it does not exist. It gives us much flexibility.
When the cluster comes up, it reads from core.properties and creates the
collections if needed.
Since we have only one configuration, the collections are automatically
linked to it and the cores inherit it from the collection.
This is a very robust procedure, that helped us overcome many problems
until we stabilized our cluster which is now pretty stable.
I know that the leader might change in such case and may lose updates, but
it is ok.


The problem is that today I want to add a new config set.
When I add it and clear zookeeper, the cores cannot be created because
there are 2 configurations. This breaks my recovery procedure.

I thought about a few options:
1. Put the config Name in core.properties - this doesn't work. (It is
supported in CoreAdminHandler, but  is discouraged according to
documentation)
2. Change recovery procedure to not delete all data from zookeeper, but
only relevant parts.
3. Change recovery procedure to delete all, but recreate and link
configurations for all collections before startup.

Option #1 is my favorite, because it is very simple, it is currently not
supported, but from looking on code it looked like it is not complex to
implement.



My questions are:
1. Is there something wrong in the recovery procedure that I described ?
2. What is the best way to fix problems in cluster state, except from
editing clusterstate.json manually? Is there an automated tool for that? We
have about 100 collections in a cluster, so editing is not really a
solution.
3.Is creating a collection via core.properties is also discouraged?



Would very appreciate any answers/ thoughts on that.


Thanks,