Wait up. There's no "master" index in SolrCloud. Raw documents are
forwarded to each replica, indexed and put in the local tlog. If a
replica falls too far out of synch (say you take it offline), then the
entire index _can_ be replicated from the leader and, if the leader's
index was incomplete then that might propagate the error.

The practical consequence of this is that if _any_ replica has a
complete index, you can recover. Before going there though, the
brute-force approach is to just re-index everything from scratch.
That's likely easier, especially on indexes this size.


Here's what I'd do.

Assuming you have the Collections API calls for ADDREPLICA and
DELETEREPLICA, then:
0> Identify the complete replicas. If you're lucky you have at least
one for each shard.
1> Copy 1 good index from each shard somewhere just to have a backup.
2> DELETEREPLICA on all the incomplete replicas
2.5> I might shut down all the nodes at this point and check that all
the cores I'd deleted were gone. If any remnants exist, 'rm -rf
deleted_core_dir'.
3> ADDREPLICA to get the ones removed in <2> back.

<3> should copy the entire index from the leader for each replica. As
you do <2> the leadership will change and after you've deleted all the
incomplete replicas, one of the complete ones will be the leader and
you should be OK.


If you don't want to/can't use the Collections API, then
0> Identify the complete replicas. If you're lucky you have at least
one for each shard.
1> Shut 'em all down.
2> Copy the good index somewhere just to have a backup.
3> 'rm -rf data' for all the incomplete cores.
4> Bring up the good cores.
5> Bring up the cores that you deleted the data dirs from.

What <5> should do is replicate the entire index from the leader. When
you restart the good cores (step 4 above), they'll _become_ the
leader.


bq: Is it possible to make Solrcloud invulnerable for network problems
I'm a little surprised that this is happening. It sounds like the
network problems were such that some nodes weren't out of touch long
enough for Zookeeper to sense that they were down and put them into
recovery. Not sure there's any way to secure against that.

bq: Is it possible to see if a core is corrupt?
There's "CheckIndex", here's at least one link:
http://java.dzone.com/news/lucene-and-solrs-checkindex
What you're describing, though, is that docs just didn't make it to
the node, _not_ that the index has unexpected bits, bad disk sectors
and the like so CheckIndex can't detect that. How would it know what
_should_ have been in the index?

bq:  I noticed a difference in the "Gen" column on Overview -
Replication. Does this mean there is something wrong?
You cannot infer anything from this. In particular, the merging will
be significantly different between a single full-reindex and what the
state of segment merges is in an incrementally built index.

The admin UI screen is rooted in the pre-cloud days, the Master/Slave
thing is entirely misleading. In SolrCloud, since all the raw data is
forwarded to all replicas, and any auto commits that happen may very
well be slightly out of sync, the index size, number of segments,
generations, and all that are pretty safely ignored.

Best,
Erick

On Thu, Mar 5, 2015 at 6:50 AM, Martin de Vries <mar...@downnotifier.com> wrote:
> Hi Andrew,
>
> Even our master index is corrupt, so I'm afraid this won't help in our case.
>
> Martin
>
>
> Andrew Butkus schreef op 05.03.2015 16:45:
>
>
>> Force a fetchindex on slave from master command:
>> http://slave_host:port/solr/replication?command=fetchindex - from
>> http://wiki.apache.org/solr/SolrReplication
>>
>> The above command will download the whole index from master to slave,
>> there are configuration options in solr to make this problem happen less
>> often (allowing it to recover from new documents added and only send the
>> changes with a wider gap) - but I cant remember what those were.
>
>

Reply via email to