Hi, What we're seeing is two replica's that are, for whatever reason, not completely in sync and unable to recover. While one node is in the `recovering state` we can fool it to be in active state by simply reloading the core on the machine on which it is in a bad state. Some arbitrary time later it will return to recovering.
Below is the communication between and the exceptions on the nodes in order: Recovering machine starts the recovery process: 2012-11-01 08:55:13,533 INFO [solr.cloud.RecoveryStrategy] - [RecoveryThread] - : Starting Replication Recovery. core=shard_i 2012-11-01 08:55:13,534 INFO [solrj.impl.HttpClientUtil] - [RecoveryThread] - : Creating new http client, config:maxConnections=128&maxConnectionsPerHost=32&followRedirects=false 2012-11-01 08:55:13,732 INFO [common.cloud.ZkStateReader] - [main-EventThread] - : A cluster state change has occurred - updating... (10) 2012-11-01 08:55:14,525 INFO [solr.core.SolrCore] - [http-8080-exec-2] - : [shard_i] webapp=/solr path=/admin/ping params={} hits=1 status=0 QTime=10 2012-11-01 08:55:14,525 INFO [solr.core.SolrCore] - [http-8080-exec-2] - : [shard_i] webapp=/solr path=/admin/ping params={} status=0 QTime=10 2012-11-01 08:55:14,700 INFO [solr.core.SolrCore] - [http-8080-exec-5] - : [shard_i] webapp=/solr path=/admin/ping params={} hits=1 status=0 QTime=5 2012-11-01 08:55:14,701 INFO [solr.core.SolrCore] - [http-8080-exec-5] - : [shard_i] webapp=/solr path=/admin/ping params={} status=0 QTime=6 2012-11-01 08:55:15,546 INFO [solr.cloud.RecoveryStrategy] - [RecoveryThread] - : Begin buffering updates. core=shard_i 2012-11-01 08:55:15,546 INFO [solr.update.UpdateLog] - [RecoveryThread] - : Starting to buffer updates. FSUpdateLog{state=ACTIVE, tlog=null} 2012-11-01 08:55:15,546 INFO [solr.cloud.RecoveryStrategy] - [RecoveryThread] - : Attempting to replicate from http://rot09.nl.idx.openindex.io:8080/solr/shard_i/. core=shard_i 2012-11-01 08:55:15,546 INFO [solrj.impl.HttpClientUtil] - [RecoveryThread] - : Creating new http client, config:maxConnections=128&maxConnectionsPerHost=32&followRedirects=false 2012-11-01 08:55:15,557 INFO [solr.handler.SnapPuller] - [RecoveryThread] - : No value set for 'pollInterval'. Timer Task not started. 2012-11-01 08:55:15,559 INFO [solr.handler.SnapPuller] - [RecoveryThread] - : Master's generation: 166 2012-11-01 08:55:15,559 INFO [solr.handler.SnapPuller] - [RecoveryThread] - : Slave's generation: 3 2012-11-01 08:55:15,559 INFO [solr.handler.SnapPuller] - [RecoveryThread] - : Starting replication process 2012-11-01 08:55:15,568 INFO [solr.handler.SnapPuller] - [RecoveryThread] - : Number of files in latest index in master: 324 2012-11-01 08:55:15,569 WARN [solr.core.CachingDirectoryFactory] - [RecoveryThread] - : No lockType configured for NRTCachingDirectory(org.apache.lucene.store.MMapDirectory@/opt/solr/cores/shard_i/data/index.20121101085515568 lockFactory=org.apache.lucene.store.NativeFSLockFactory@6f8288d2; maxCacheMB=48.0 maxMergeSizeMB=4.0) assuming 'simple' 2012-11-01 08:55:15,569 INFO [solr.core.CachingDirectoryFactory] - [RecoveryThread] - : return new directory for /opt/solr/cores/shard_i/data/index.20121101085515568 forceNew:false 2012-11-01 08:55:15,569 INFO [solr.core.CachingDirectoryFactory] - [RecoveryThread] - : Releasing directory:/opt/solr/cores/shard_i/data 2012-11-01 08:55:15,572 INFO [solr.handler.SnapPuller] - [RecoveryThread] - : Starting download to NRTCachingDirectory(org.apache.lucene.store.MMapDirectory@/opt/solr/cores/shard_i/data/index.20121101085515568 lockFactory=org.apache.lucene.store.SimpleFSLockFactory@1b195462; maxCacheMB=48.0 maxMergeSizeMB=4.0) fullCopy=true Right now the leader is transmitting packets as we expect: 2012-11-01 08:55:15,567 INFO [solr.core.SolrCore] - [http-8080-exec-2] - : [shard_i] webapp=/solr path=/replication params={command=filelist&generation=166&wt=javabin&q t=/replication&version=2} status=0 QTime=4 2012-11-01 08:55:15,576 INFO [solr.core.SolrCore] - [http-8080-exec-3] - : [shard_i] webapp=/solr path=/replication params={file=_2b9_nrm.cfs&command=filecontent&checks um=true&generation=166&qt=/replication&wt=filestream} status=0 QTime=0 2012-11-01 08:55:15,583 INFO [solr.core.SolrCore] - [http-8080-exec-5] - : [shard_i] webapp=/solr path=/replication params={file=_2t8_Lucene41_0.doc&command=filecontent &checksum=true&generation=166&qt=/replication&wt=filestream} status=0 QTime=0 2012-11-01 08:55:15,589 INFO [solr.core.SolrCore] - [http-8080-exec-1] - : [shard_i] webapp=/solr path=/replication params={file=_2sc.tvd&command=filecontent&checksum=t rue&generation=166&qt=/replication&wt=filestream} status=0 QTime=0 2012-11-01 08:55:15,598 INFO [solr.core.SolrCore] - [http-8080-exec-4] - : [shard_i] webapp=/solr path=/replication params={file=_2qk.tvd&command=filecontent&checksum=t rue&generation=166&qt=/replication&wt=filestream} status=0 QTime=0 2012-11-01 08:55:15,603 INFO [solr.core.SolrCore] - [http-8080-exec-6] - : [shard_i] webapp=/solr path=/replication params={file=_2ii_Lucene41_0.pos&command=filecontent &checksum=true&generation=166&qt=/replication&wt=filestream} status=0 QTime=0 Until the leader cannot read the index and therefore not complete the upload: 2012-11-01 08:55:15,685 WARN [solr.handler.ReplicationHandler] - [http-8080-exec-6] - : Exception while writing response for params: file=_2ii_Lucene41_0.pos&command=filecontent&checksum=true&generation=166&qt=/replication&wt=filestream java.io.EOFException: read past EOF: MMapIndexInput(path="/opt/solr/cores/shard_i/data/index.20121030152231942/_2ii_Lucene41_0.pos") at org.apache.lucene.store.ByteBufferIndexInput.readBytes(ByteBufferIndexInput.java:100) at org.apache.solr.handler.ReplicationHandler$DirectoryFileStream.write(ReplicationHandler.java:1065) at org.apache.solr.handler.ReplicationHandler$3.write(ReplicationHandler.java:932) at org.apache.solr.servlet.SolrDispatchFilter.writeResponse(SolrDispatchFilter.java:425) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:289) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:293) at org.apache.coyote.http11.Http11NioProcessor.process(Http11NioProcessor.java:889) at org.apache.coyote.http11.Http11NioProtocol$Http11ConnectionHandler.process(Http11NioProtocol.java:744) at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.run(NioEndpoint.java:2274) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) 2012-11-01 08:55:15,690 INFO [solr.core.SolrCore] - [http-8080-exec-2] - : [shard_i] webapp=/solr path=/replication params={file=_2ii_Lucene41_0.pos&command=filecontent &checksum=true&offset=1048576&generation=166&qt=/replication&wt=filestream} status=0 QTime=0 2012-11-01 08:55:15,691 ERROR [solr.servlet.SolrDispatchFilter] - [http-8080-exec-2] - : null:java.lang.IndexOutOfBoundsException at java.nio.Buffer.checkBounds(Buffer.java:530) at java.nio.DirectByteBuffer.get(DirectByteBuffer.java:218) at org.apache.lucene.store.ByteBufferIndexInput.readBytes(ByteBufferIndexInput.java:91) at org.apache.solr.handler.ReplicationHandler$DirectoryFileStream.write(ReplicationHandler.java:1065) at org.apache.solr.handler.ReplicationHandler$3.write(ReplicationHandler.java:932) at org.apache.solr.servlet.SolrDispatchFilter.writeResponse(SolrDispatchFilter.java:425) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:289) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:293) at org.apache.coyote.http11.Http11NioProcessor.process(Http11NioProcessor.java:889) at org.apache.coyote.http11.Http11NioProtocol$Http11ConnectionHandler.process(Http11NioProtocol.java:744) at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.run(NioEndpoint.java:2274) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) 2012-11-01 08:55:15,692 ERROR [solr.servlet.SolrDispatchFilter] - [http-8080-exec-2] - : null:java.lang.IndexOutOfBoundsException at java.nio.Buffer.checkBounds(Buffer.java:530) at java.nio.DirectByteBuffer.get(DirectByteBuffer.java:218) at org.apache.lucene.store.ByteBufferIndexInput.readBytes(ByteBufferIndexInput.java:91) at org.apache.solr.handler.ReplicationHandler$DirectoryFileStream.write(ReplicationHandler.java:1065) at org.apache.solr.handler.ReplicationHandler$3.write(ReplicationHandler.java:932) at org.apache.solr.servlet.SolrDispatchFilter.writeResponse(SolrDispatchFilter.java:425) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:289) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:293) at org.apache.coyote.http11.Http11NioProcessor.process(Http11NioProcessor.java:889) at org.apache.coyote.http11.Http11NioProtocol$Http11ConnectionHandler.process(Http11NioProtocol.java:744) at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.run(NioEndpoint.java:2274) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) The recovering node isn't very happy and drops the replication and waits... 2012-11-01 08:55:15,686 WARN [solr.handler.SnapPuller] - [RecoveryThread] - : Error in fetching packets java.io.EOFException at org.apache.solr.common.util.FastInputStream.readFully(FastInputStream.java:151) at org.apache.solr.common.util.FastInputStream.readFully(FastInputStream.java:144) at org.apache.solr.handler.SnapPuller$DirectoryFileFetcher.fetchPackets(SnapPuller.java:1143) at org.apache.solr.handler.SnapPuller$DirectoryFileFetcher.fetchFile(SnapPuller.java:1107) at org.apache.solr.handler.SnapPuller.downloadIndexFiles(SnapPuller.java:716) at org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:387) at org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:273) at org.apache.solr.cloud.RecoveryStrategy.replicate(RecoveryStrategy.java:152) at org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:407) at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:222) 2012-11-01 08:55:15,692 INFO [solr.core.CachingDirectoryFactory] - [RecoveryThread] - : Releasing directory:/opt/solr/cores/shard_i/data/index 2012-11-01 08:55:15,692 INFO [solr.handler.SnapPuller] - [RecoveryThread] - : removing temporary index download directory files NRTCachingDirectory(org.apache.lucene.store.MMapDirectory@/opt/solr/cores/shard_i/data/index.20121101085515568 lockFactory=org.apache.lucene.store.SimpleFSLockFactory@1b195462; maxCacheMB=48.0 maxMergeSizeMB=4.0) 2012-11-01 08:55:15,693 INFO [solr.core.CachingDirectoryFactory] - [RecoveryThread] - : Releasing directory:/opt/solr/cores/shard_i/data/index.20121101085515568 2012-11-01 08:55:15,695 INFO [solr.core.CachingDirectoryFactory] - [RecoveryThread] - : Releasing directory:/opt/solr/cores/shard_i/data/index 2012-11-01 08:55:15,695 ERROR [solr.handler.ReplicationHandler] - [RecoveryThread] - : SnapPull failed :org.apache.solr.common.SolrException: Unable to download _2ii_Lucene41_0.pos completely. Downloaded 1048576!=1501300 at org.apache.solr.handler.SnapPuller$DirectoryFileFetcher.cleanup(SnapPuller.java:1237) at org.apache.solr.handler.SnapPuller$DirectoryFileFetcher.fetchFile(SnapPuller.java:1118) at org.apache.solr.handler.SnapPuller.downloadIndexFiles(SnapPuller.java:716) at org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:387) at org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:273) at org.apache.solr.cloud.RecoveryStrategy.replicate(RecoveryStrategy.java:152) at org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:407) at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:222) 2012-11-01 08:55:15,695 ERROR [solr.cloud.RecoveryStrategy] - [RecoveryThread] - : Error while trying to recover:org.apache.solr.common.SolrException: Replication for recovery failed. at org.apache.solr.cloud.RecoveryStrategy.replicate(RecoveryStrategy.java:155) at org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:407) at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:222) 2012-11-01 08:55:15,696 INFO [solr.update.UpdateLog] - [RecoveryThread] - : Dropping buffered updates FSUpdateLog{state=BUFFERING, tlog=null} 2012-11-01 08:55:15,696 ERROR [solr.cloud.RecoveryStrategy] - [RecoveryThread] - : Recovery failed - trying again... (8) core=shard_i 2012-11-01 08:55:15,696 INFO [solr.cloud.RecoveryStrategy] - [RecoveryThread] - : Wait 512.0 seconds before trying to recover again (9) It seems to me the SolrCloud part is actually running and behaving as it should but the index file on the leader is corrupt, and that should not happen even on power loss. Any hints? Thanks Markus -----Original message----- > From:Markus Jelsma <markus.jel...@openindex.io> > Sent: Wed 31-Oct-2012 14:14 > To: solr-user@lucene.apache.org; Markus Jelsma <markus.jel...@openindex.io> > Subject: RE: trunk is unable to replicate between nodes ( Unable to download > ... completely) > > Ah, we're also seeing Solr lookup an unexisting directory: > > 2012-10-30 16:32:26,578 ERROR [handler.admin.CoreAdminHandler] - > [http-8080-exec-2] - : IO error while trying to get the size of the > Directory:org.apache.lucene.store.NoSuchDirectoryException: directory > '/opt/solr/cores/shard_a/data/index' does not exist > at org.apache.lucene.store.FSDirectory.listAll(FSDirectory.java:220) > at org.apache.lucene.store.FSDirectory.listAll(FSDirectory.java:243) > at > org.apache.lucene.store.NRTCachingDirectory.listAll(NRTCachingDirectory.java:132) > at > org.apache.solr.core.DirectoryFactory.sizeOfDirectory(DirectoryFactory.java:146) > > Instead of data/index it should be looking for data/index.20121030152324761/, > which actually does exist. > > > > -----Original message----- > > From:Markus Jelsma <markus.jel...@openindex.io> > > Sent: Tue 30-Oct-2012 17:30 > > To: solr-user@lucene.apache.org > > Subject: trunk is unable to replicate between nodes ( Unable to download > > ... completely) > > > > Hi, > > > > We're testing again with today's trunk and using the new Lucene 4.1 format > > by default. When nodes are not restarted things are kind of stable but > > restarting nodes leads to a lot of mayhem. It seems we can get the cluster > > back up and running by clearing ZK and restarting everything (another > > issue) but replication becomes impossible for some nodes leading to a > > continuous state of failing recovery etc. > > > > Here are some excepts from the logs: > > > > 2012-10-30 16:12:39,674 ERROR [solr.servlet.SolrDispatchFilter] - > > [http-8080-exe > > c-5] - : null:java.lang.IndexOutOfBoundsException > > at java.nio.Buffer.checkBounds(Buffer.java:530) > > at java.nio.DirectByteBuffer.get(DirectByteBuffer.java:218) > > at > > org.apache.lucene.store.ByteBufferIndexInput.readBytes(ByteBufferInde > > xInput.java:91) > > at > > org.apache.solr.handler.ReplicationHandler$DirectoryFileStream.write( > > ReplicationHandler.java:1065) > > at > > org.apache.solr.handler.ReplicationHandler$3.write(ReplicationHandler.java:932) > > > > > > 2012-10-30 16:10:32,220 ERROR [solr.handler.ReplicationHandler] - > > [RecoveryThrea > > d] - : SnapPull failed :org.apache.solr.common.SolrException: Unable to > > download > > _x.fdt completely. Downloaded 13631488!=13843504 > > at > > org.apache.solr.handler.SnapPuller$DirectoryFileFetcher.cleanup(SnapP > > uller.java:1237) > > at > > org.apache.solr.handler.SnapPuller$DirectoryFileFetcher.fetchFile(Sna > > pPuller.java:1118) > > at > > org.apache.solr.handler.SnapPuller.downloadIndexFiles(SnapPuller.java > > :716) > > at > > org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:387) > > at > > org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:273) > > at > > org.apache.solr.cloud.RecoveryStrategy.replicate(RecoveryStrategy.java:152) > > at > > org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:407) > > > > 2012-10-30 16:12:51,061 WARN [solr.handler.ReplicationHandler] - > > [http-8080-exec > > -3] - : Exception while writing response for params: > > file=_p_Lucene41_0.doc&comm > > and=filecontent&checksum=true&generation=6&qt=/replication&wt=filestream > > java.io.EOFException: read past EOF: > > MMapIndexInput(path="/opt/solr/cores/openindex_h/data/index.20121030152234973/_p_Lucene41_0.doc") > > at > > org.apache.lucene.store.ByteBufferIndexInput.readBytes(ByteBufferIndexInput.java:100) > > at > > org.apache.solr.handler.ReplicationHandler$DirectoryFileStream.write(ReplicationHandler.java:1065) > > at > > org.apache.solr.handler.ReplicationHandler$3.write(ReplicationHandler.java:932) > > > > > > Needless to say i'm puzzled so i'm wondering if anyone has seen this before > > or have some hints that might help digg further. > > > > Thanks, > > Markus > > >