Hi Erick,

Thank you for your detailed reply.

You say in our case some docs didn't made it to the node, but that's not really true: the docs can be found on the corrupted nodes when I search on ID. The docs are also complete. The problem is that the docs do not appear when I filter on certain fields (however the fields are in the doc and have the right value when I search on ID). So something seems to be corrupt in the filter index. We will try the checkindex, hopefully it is able to identify the problematic cores.

I understand there is not a "master" in SolrCloud. In our case we use haproxy as a load balancer for every request. So when indexing every document will be sent to a different solr server, immediately after each other. Maybe SolrCloud is not able to handle that correctly?


Thanks,

Martin




Erick Erickson schreef op 05.03.2015 19:00:

Wait up. There's no "master" index in SolrCloud. Raw documents are
forwarded to each replica, indexed and put in the local tlog. If a
replica falls too far out of synch (say you take it offline), then the
entire index _can_ be replicated from the leader and, if the leader's
index was incomplete then that might propagate the error.

The practical consequence of this is that if _any_ replica has a
complete index, you can recover. Before going there though, the
brute-force approach is to just re-index everything from scratch.
That's likely easier, especially on indexes this size.

Here's what I'd do.

Assuming you have the Collections API calls for ADDREPLICA and
DELETEREPLICA, then:
0> Identify the complete replicas. If you're lucky you have at least
one for each shard.
1> Copy 1 good index from each shard somewhere just to have a backup.
2> DELETEREPLICA on all the incomplete replicas
2.5> I might shut down all the nodes at this point and check that all
the cores I'd deleted were gone. If any remnants exist, 'rm -rf
deleted_core_dir'.
3> ADDREPLICA to get the ones removed in back.

should copy the entire index from the leader for each replica. As
you do the leadership will change and after you've deleted all the
incomplete replicas, one of the complete ones will be the leader and
you should be OK.

If you don't want to/can't use the Collections API, then
0> Identify the complete replicas. If you're lucky you have at least
one for each shard.
1> Shut 'em all down.
2> Copy the good index somewhere just to have a backup.
3> 'rm -rf data' for all the incomplete cores.
4> Bring up the good cores.
5> Bring up the cores that you deleted the data dirs from.

What should do is replicate the entire index from the leader. When
you restart the good cores (step 4 above), they'll _become_ the
leader.

bq: Is it possible to make Solrcloud invulnerable for network problems
I'm a little surprised that this is happening. It sounds like the
network problems were such that some nodes weren't out of touch long
enough for Zookeeper to sense that they were down and put them into
recovery. Not sure there's any way to secure against that.

bq: Is it possible to see if a core is corrupt?
There's "CheckIndex", here's at least one link:
http://java.dzone.com/news/lucene-and-solrs-checkindex
What you're describing, though, is that docs just didn't make it to
the node, _not_ that the index has unexpected bits, bad disk sectors
and the like so CheckIndex can't detect that. How would it know what
_should_ have been in the index?

bq: I noticed a difference in the "Gen" column on Overview -
Replication. Does this mean there is something wrong?
You cannot infer anything from this. In particular, the merging will
be significantly different between a single full-reindex and what the
state of segment merges is in an incrementally built index.

The admin UI screen is rooted in the pre-cloud days, the Master/Slave
thing is entirely misleading. In SolrCloud, since all the raw data is
forwarded to all replicas, and any auto commits that happen may very
well be slightly out of sync, the index size, number of segments,
generations, and all that are pretty safely ignored.

Best,
Erick

On Thu, Mar 5, 2015 at 6:50 AM, Martin de Vries <mar...@downnotifier.com>
wrote:

Hi Andrew, Even our master index is corrupt, so I'm afraid this won't
help in our case. Martin Andrew Butkus schreef op 05.03.2015 16:45:

Force a fetchindex on slave from master command:
http://slave_host:port/solr/replication?command=fetchindex - from
http://wiki.apache.org/solr/SolrReplication [1] The above command
will download the whole index from master to slave, there are
configuration options in solr to make this problem happen less often
(allowing it to recover from new documents added and only send the
changes with a wider gap) - but I cant remember what those were.



Links:
------
[1] http://wiki.apache.org/solr/SolrReplication

Reply via email to