Re: High-Availability deployment

Yonik Seeley Mon, 08 Oct 2007 09:31:04 -0700

On 10/8/07, Daniel Alheiros <[EMAIL PROTECTED]> wrote:
> Well I believe I can live with some staleness at certain moments, but it's
> not good as users are supposed to need it 24x7. So the common practice is to
> make one of the slaves as the new master and switch things over to it and
> after the outage put them in sync again and do the proper switch back? OK,
> I'll follow this, but I'm still concerned about the amount of manual steps
> to be done...


That was the plan - never needed it though... (never had a master
completely die that I know of).  Having the collection not be updated
for an hour or so while the ops folks fixed things always worked fine.

> And other important issue is
> how frequently have you seen indexes getting corrupted?

Just once I think - no idea of the cause (and I think it was quite an
old version of lucene).

> If I try to run a
> commit or optimize on a Solr master instance and it's index got corrupted
> will it run the command?

Almost all of the cases I've seen of a master failing was an OOM
error, often during segment merging (again, older versions of Lucene,
and someone forgot to change the JVM heap size from the default).
This could cause a situation where you added a document but the old
one was not deleted (overwritten).  Not "corrupted" at the Lucene
level, but if the JVM died at the wrong spot, search results could
possibly return two documents for the same unique key.  We normally
just rebuilt after a crash.

> And more importantly, will it run the
> postOptimize/postCommit scripts generating snapshots and then possibly
> propagating the bad index?

Normally not, I think... the JVM crash/restart left the lucene write
lock aquired on the index and further attempts to modify it failed.

-Yonik

Re: High-Availability deployment

Reply via email to