Solr replication hangs on multiple slave nodes

Justin Babuscio Thu, 04 Oct 2012 09:06:32 -0700

After a large index rebuild (16-masters with ~15GB each), some slaves fail
to completely replicate.


We are running Solr v3.5 with 16 masters and 2 slaves each for a total of
48 servers.

4 of the 32 slaves sit in a stalled replication state with similar messages:

Files Downloaded:  254/260
Downloaded: 12.09 GB / 12.09 GB [ 100% ]
Downloading File: _t6.fdt, Downloaded: 3.1 MB / 3.1 MB [ 100 % ]
Time Elapsed: 3215s, EStimated Time REmaining: 0s, Speed: 24.5 MB/s


As you'll notice, all download sizes appear to be complete but the files
downloaded are not.  This also prevents the servers from polling for a new
update from the masters.  When searching, we are occasionally seeing 500
responses from the slaves that fail to replicate.  The errors are

ArrayIndexOutOfBounds - this occurs when writing the HTTP Response (our
container is WebSphere)
NullPointerExceptions - org.apache.lucnee.queryParser.QueryParser.parse
(QueryParser.java:203 )

We have tried to stop the slave, delete the /data directory, and restart.
 This started downloading the index but stalled as expected.

Thanks,
Justin

Solr replication hangs on multiple slave nodes

Reply via email to