After a large index rebuild (16-masters with ~15GB each), some slaves fail to completely replicate.
We are running Solr v3.5 with 16 masters and 2 slaves each for a total of 48 servers. 4 of the 32 slaves sit in a stalled replication state with similar messages: Files Downloaded: 254/260 Downloaded: 12.09 GB / 12.09 GB [ 100% ] Downloading File: _t6.fdt, Downloaded: 3.1 MB / 3.1 MB [ 100 % ] Time Elapsed: 3215s, EStimated Time REmaining: 0s, Speed: 24.5 MB/s As you'll notice, all download sizes appear to be complete but the files downloaded are not. This also prevents the servers from polling for a new update from the masters. When searching, we are occasionally seeing 500 responses from the slaves that fail to replicate. The errors are ArrayIndexOutOfBounds - this occurs when writing the HTTP Response (our container is WebSphere) NullPointerExceptions - org.apache.lucnee.queryParser.QueryParser.parse (QueryParser.java:203 ) We have tried to stop the slave, delete the /data directory, and restart. This started downloading the index but stalled as expected. Thanks, Justin