Wait up. There's no "master" index in SolrCloud. Raw documents are forwarded to each replica, indexed and put in the local tlog. If a replica falls too far out of synch (say you take it offline), then the entire index _can_ be replicated from the leader and, if the leader's index was incomplete then that might propagate the error.
The practical consequence of this is that if _any_ replica has a complete index, you can recover. Before going there though, the brute-force approach is to just re-index everything from scratch. That's likely easier, especially on indexes this size. Here's what I'd do. Assuming you have the Collections API calls for ADDREPLICA and DELETEREPLICA, then: 0> Identify the complete replicas. If you're lucky you have at least one for each shard. 1> Copy 1 good index from each shard somewhere just to have a backup. 2> DELETEREPLICA on all the incomplete replicas 2.5> I might shut down all the nodes at this point and check that all the cores I'd deleted were gone. If any remnants exist, 'rm -rf deleted_core_dir'. 3> ADDREPLICA to get the ones removed in <2> back. <3> should copy the entire index from the leader for each replica. As you do <2> the leadership will change and after you've deleted all the incomplete replicas, one of the complete ones will be the leader and you should be OK. If you don't want to/can't use the Collections API, then 0> Identify the complete replicas. If you're lucky you have at least one for each shard. 1> Shut 'em all down. 2> Copy the good index somewhere just to have a backup. 3> 'rm -rf data' for all the incomplete cores. 4> Bring up the good cores. 5> Bring up the cores that you deleted the data dirs from. What <5> should do is replicate the entire index from the leader. When you restart the good cores (step 4 above), they'll _become_ the leader. bq: Is it possible to make Solrcloud invulnerable for network problems I'm a little surprised that this is happening. It sounds like the network problems were such that some nodes weren't out of touch long enough for Zookeeper to sense that they were down and put them into recovery. Not sure there's any way to secure against that. bq: Is it possible to see if a core is corrupt? There's "CheckIndex", here's at least one link: http://java.dzone.com/news/lucene-and-solrs-checkindex What you're describing, though, is that docs just didn't make it to the node, _not_ that the index has unexpected bits, bad disk sectors and the like so CheckIndex can't detect that. How would it know what _should_ have been in the index? bq: I noticed a difference in the "Gen" column on Overview - Replication. Does this mean there is something wrong? You cannot infer anything from this. In particular, the merging will be significantly different between a single full-reindex and what the state of segment merges is in an incrementally built index. The admin UI screen is rooted in the pre-cloud days, the Master/Slave thing is entirely misleading. In SolrCloud, since all the raw data is forwarded to all replicas, and any auto commits that happen may very well be slightly out of sync, the index size, number of segments, generations, and all that are pretty safely ignored. Best, Erick On Thu, Mar 5, 2015 at 6:50 AM, Martin de Vries <mar...@downnotifier.com> wrote: > Hi Andrew, > > Even our master index is corrupt, so I'm afraid this won't help in our case. > > Martin > > > Andrew Butkus schreef op 05.03.2015 16:45: > > >> Force a fetchindex on slave from master command: >> http://slave_host:port/solr/replication?command=fetchindex - from >> http://wiki.apache.org/solr/SolrReplication >> >> The above command will download the whole index from master to slave, >> there are configuration options in solr to make this problem happen less >> often (allowing it to recover from new documents added and only send the >> changes with a wider gap) - but I cant remember what those were. > >