Hi,

What we're seeing is two replica's that are, for whatever reason, not 
completely in sync and unable to recover. While one node is in the `recovering 
state` we can fool it to be in active state by simply reloading the core on the 
machine on which it is in a bad state. Some arbitrary time later it will return 
to recovering.

Below is the communication between and the exceptions on the nodes in order:


Recovering machine starts the recovery process:

2012-11-01 08:55:13,533 INFO [solr.cloud.RecoveryStrategy] - [RecoveryThread] - 
: Starting Replication Recovery. core=shard_i
2012-11-01 08:55:13,534 INFO [solrj.impl.HttpClientUtil] - [RecoveryThread] - : 
Creating new http client, 
config:maxConnections=128&maxConnectionsPerHost=32&followRedirects=false
2012-11-01 08:55:13,732 INFO [common.cloud.ZkStateReader] - [main-EventThread] 
- : A cluster state change has occurred - updating... (10)
2012-11-01 08:55:14,525 INFO [solr.core.SolrCore] - [http-8080-exec-2] - : 
[shard_i] webapp=/solr path=/admin/ping params={} hits=1 status=0 QTime=10 
2012-11-01 08:55:14,525 INFO [solr.core.SolrCore] - [http-8080-exec-2] - : 
[shard_i] webapp=/solr path=/admin/ping params={} status=0 QTime=10 
2012-11-01 08:55:14,700 INFO [solr.core.SolrCore] - [http-8080-exec-5] - : 
[shard_i] webapp=/solr path=/admin/ping params={} hits=1 status=0 QTime=5 
2012-11-01 08:55:14,701 INFO [solr.core.SolrCore] - [http-8080-exec-5] - : 
[shard_i] webapp=/solr path=/admin/ping params={} status=0 QTime=6 
2012-11-01 08:55:15,546 INFO [solr.cloud.RecoveryStrategy] - [RecoveryThread] - 
: Begin buffering updates. core=shard_i
2012-11-01 08:55:15,546 INFO [solr.update.UpdateLog] - [RecoveryThread] - : 
Starting to buffer updates. FSUpdateLog{state=ACTIVE, tlog=null}
2012-11-01 08:55:15,546 INFO [solr.cloud.RecoveryStrategy] - [RecoveryThread] - 
: Attempting to replicate from 
http://rot09.nl.idx.openindex.io:8080/solr/shard_i/. core=shard_i
2012-11-01 08:55:15,546 INFO [solrj.impl.HttpClientUtil] - [RecoveryThread] - : 
Creating new http client, 
config:maxConnections=128&maxConnectionsPerHost=32&followRedirects=false
2012-11-01 08:55:15,557 INFO [solr.handler.SnapPuller] - [RecoveryThread] - :  
No value set for 'pollInterval'. Timer Task not started.
2012-11-01 08:55:15,559 INFO [solr.handler.SnapPuller] - [RecoveryThread] - : 
Master's generation: 166
2012-11-01 08:55:15,559 INFO [solr.handler.SnapPuller] - [RecoveryThread] - : 
Slave's generation: 3
2012-11-01 08:55:15,559 INFO [solr.handler.SnapPuller] - [RecoveryThread] - : 
Starting replication process
2012-11-01 08:55:15,568 INFO [solr.handler.SnapPuller] - [RecoveryThread] - : 
Number of files in latest index in master: 324
2012-11-01 08:55:15,569 WARN [solr.core.CachingDirectoryFactory] - 
[RecoveryThread] - : No lockType configured for 
NRTCachingDirectory(org.apache.lucene.store.MMapDirectory@/opt/solr/cores/shard_i/data/index.20121101085515568
 lockFactory=org.apache.lucene.store.NativeFSLockFactory@6f8288d2; 
maxCacheMB=48.0 maxMergeSizeMB=4.0) assuming 'simple'
2012-11-01 08:55:15,569 INFO [solr.core.CachingDirectoryFactory] - 
[RecoveryThread] - : return new directory for 
/opt/solr/cores/shard_i/data/index.20121101085515568 forceNew:false
2012-11-01 08:55:15,569 INFO [solr.core.CachingDirectoryFactory] - 
[RecoveryThread] - : Releasing directory:/opt/solr/cores/shard_i/data
2012-11-01 08:55:15,572 INFO [solr.handler.SnapPuller] - [RecoveryThread] - : 
Starting download to 
NRTCachingDirectory(org.apache.lucene.store.MMapDirectory@/opt/solr/cores/shard_i/data/index.20121101085515568
 lockFactory=org.apache.lucene.store.SimpleFSLockFactory@1b195462; 
maxCacheMB=48.0 maxMergeSizeMB=4.0) fullCopy=true


Right now the leader is transmitting packets as we expect:


2012-11-01 08:55:15,567 INFO [solr.core.SolrCore] - [http-8080-exec-2] - : 
[shard_i] webapp=/solr path=/replication 
params={command=filelist&generation=166&wt=javabin&q
t=/replication&version=2} status=0 QTime=4 
2012-11-01 08:55:15,576 INFO [solr.core.SolrCore] - [http-8080-exec-3] - : 
[shard_i] webapp=/solr path=/replication 
params={file=_2b9_nrm.cfs&command=filecontent&checks
um=true&generation=166&qt=/replication&wt=filestream} status=0 QTime=0 
2012-11-01 08:55:15,583 INFO [solr.core.SolrCore] - [http-8080-exec-5] - : 
[shard_i] webapp=/solr path=/replication 
params={file=_2t8_Lucene41_0.doc&command=filecontent
&checksum=true&generation=166&qt=/replication&wt=filestream} status=0 QTime=0 
2012-11-01 08:55:15,589 INFO [solr.core.SolrCore] - [http-8080-exec-1] - : 
[shard_i] webapp=/solr path=/replication 
params={file=_2sc.tvd&command=filecontent&checksum=t
rue&generation=166&qt=/replication&wt=filestream} status=0 QTime=0 
2012-11-01 08:55:15,598 INFO [solr.core.SolrCore] - [http-8080-exec-4] - : 
[shard_i] webapp=/solr path=/replication 
params={file=_2qk.tvd&command=filecontent&checksum=t
rue&generation=166&qt=/replication&wt=filestream} status=0 QTime=0 
2012-11-01 08:55:15,603 INFO [solr.core.SolrCore] - [http-8080-exec-6] - : 
[shard_i] webapp=/solr path=/replication 
params={file=_2ii_Lucene41_0.pos&command=filecontent
&checksum=true&generation=166&qt=/replication&wt=filestream} status=0 QTime=0 


Until the leader cannot read the index and therefore not complete the upload:


2012-11-01 08:55:15,685 WARN [solr.handler.ReplicationHandler] - 
[http-8080-exec-6] - : Exception while writing response for params: 
file=_2ii_Lucene41_0.pos&command=filecontent&checksum=true&generation=166&qt=/replication&wt=filestream
java.io.EOFException: read past EOF: 
MMapIndexInput(path="/opt/solr/cores/shard_i/data/index.20121030152231942/_2ii_Lucene41_0.pos")
        at 
org.apache.lucene.store.ByteBufferIndexInput.readBytes(ByteBufferIndexInput.java:100)
        at 
org.apache.solr.handler.ReplicationHandler$DirectoryFileStream.write(ReplicationHandler.java:1065)
        at 
org.apache.solr.handler.ReplicationHandler$3.write(ReplicationHandler.java:932)
        at 
org.apache.solr.servlet.SolrDispatchFilter.writeResponse(SolrDispatchFilter.java:425)
        at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:289)
        at 
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
        at 
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
        at 
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
        at 
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
        at 
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
        at 
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
        at 
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
        at 
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:293)
        at 
org.apache.coyote.http11.Http11NioProcessor.process(Http11NioProcessor.java:889)
        at 
org.apache.coyote.http11.Http11NioProtocol$Http11ConnectionHandler.process(Http11NioProtocol.java:744)
        at 
org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.run(NioEndpoint.java:2274)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:662)
2012-11-01 08:55:15,690 INFO [solr.core.SolrCore] - [http-8080-exec-2] - : 
[shard_i] webapp=/solr path=/replication 
params={file=_2ii_Lucene41_0.pos&command=filecontent
&checksum=true&offset=1048576&generation=166&qt=/replication&wt=filestream} 
status=0 QTime=0 
2012-11-01 08:55:15,691 ERROR [solr.servlet.SolrDispatchFilter] - 
[http-8080-exec-2] - : null:java.lang.IndexOutOfBoundsException
        at java.nio.Buffer.checkBounds(Buffer.java:530)
        at java.nio.DirectByteBuffer.get(DirectByteBuffer.java:218)
        at 
org.apache.lucene.store.ByteBufferIndexInput.readBytes(ByteBufferIndexInput.java:91)
        at 
org.apache.solr.handler.ReplicationHandler$DirectoryFileStream.write(ReplicationHandler.java:1065)
        at 
org.apache.solr.handler.ReplicationHandler$3.write(ReplicationHandler.java:932)
        at 
org.apache.solr.servlet.SolrDispatchFilter.writeResponse(SolrDispatchFilter.java:425)
        at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:289)
        at 
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
        at 
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
        at 
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
        at 
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
        at 
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
        at 
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
        at 
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
        at 
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:293)
        at 
org.apache.coyote.http11.Http11NioProcessor.process(Http11NioProcessor.java:889)
        at 
org.apache.coyote.http11.Http11NioProtocol$Http11ConnectionHandler.process(Http11NioProtocol.java:744)
        at 
org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.run(NioEndpoint.java:2274)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:662)
2012-11-01 08:55:15,692 ERROR [solr.servlet.SolrDispatchFilter] - 
[http-8080-exec-2] - : null:java.lang.IndexOutOfBoundsException
        at java.nio.Buffer.checkBounds(Buffer.java:530)
        at java.nio.DirectByteBuffer.get(DirectByteBuffer.java:218)
        at 
org.apache.lucene.store.ByteBufferIndexInput.readBytes(ByteBufferIndexInput.java:91)
        at 
org.apache.solr.handler.ReplicationHandler$DirectoryFileStream.write(ReplicationHandler.java:1065)
        at 
org.apache.solr.handler.ReplicationHandler$3.write(ReplicationHandler.java:932)
        at 
org.apache.solr.servlet.SolrDispatchFilter.writeResponse(SolrDispatchFilter.java:425)
        at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:289)
        at 
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
        at 
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
        at 
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
        at 
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
        at 
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
        at 
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
        at 
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
        at 
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:293)
        at 
org.apache.coyote.http11.Http11NioProcessor.process(Http11NioProcessor.java:889)
        at 
org.apache.coyote.http11.Http11NioProtocol$Http11ConnectionHandler.process(Http11NioProtocol.java:744)
        at 
org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.run(NioEndpoint.java:2274)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:662)


The recovering node isn't very happy and drops the replication and waits...


2012-11-01 08:55:15,686 WARN [solr.handler.SnapPuller] - [RecoveryThread] - : 
Error in fetching packets 
java.io.EOFException
        at 
org.apache.solr.common.util.FastInputStream.readFully(FastInputStream.java:151)
        at 
org.apache.solr.common.util.FastInputStream.readFully(FastInputStream.java:144)
        at 
org.apache.solr.handler.SnapPuller$DirectoryFileFetcher.fetchPackets(SnapPuller.java:1143)
        at 
org.apache.solr.handler.SnapPuller$DirectoryFileFetcher.fetchFile(SnapPuller.java:1107)
        at 
org.apache.solr.handler.SnapPuller.downloadIndexFiles(SnapPuller.java:716)
        at 
org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:387)
        at 
org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:273)
        at 
org.apache.solr.cloud.RecoveryStrategy.replicate(RecoveryStrategy.java:152)
        at 
org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:407)
        at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:222)
2012-11-01 08:55:15,692 INFO [solr.core.CachingDirectoryFactory] - 
[RecoveryThread] - : Releasing directory:/opt/solr/cores/shard_i/data/index
2012-11-01 08:55:15,692 INFO [solr.handler.SnapPuller] - [RecoveryThread] - : 
removing temporary index download directory files 
NRTCachingDirectory(org.apache.lucene.store.MMapDirectory@/opt/solr/cores/shard_i/data/index.20121101085515568
 lockFactory=org.apache.lucene.store.SimpleFSLockFactory@1b195462; 
maxCacheMB=48.0 maxMergeSizeMB=4.0)
2012-11-01 08:55:15,693 INFO [solr.core.CachingDirectoryFactory] - 
[RecoveryThread] - : Releasing 
directory:/opt/solr/cores/shard_i/data/index.20121101085515568
2012-11-01 08:55:15,695 INFO [solr.core.CachingDirectoryFactory] - 
[RecoveryThread] - : Releasing directory:/opt/solr/cores/shard_i/data/index
2012-11-01 08:55:15,695 ERROR [solr.handler.ReplicationHandler] - 
[RecoveryThread] - : SnapPull failed :org.apache.solr.common.SolrException: 
Unable to download _2ii_Lucene41_0.pos completely. Downloaded 1048576!=1501300
        at 
org.apache.solr.handler.SnapPuller$DirectoryFileFetcher.cleanup(SnapPuller.java:1237)
        at 
org.apache.solr.handler.SnapPuller$DirectoryFileFetcher.fetchFile(SnapPuller.java:1118)
        at 
org.apache.solr.handler.SnapPuller.downloadIndexFiles(SnapPuller.java:716)
        at 
org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:387)
        at 
org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:273)
        at 
org.apache.solr.cloud.RecoveryStrategy.replicate(RecoveryStrategy.java:152)
        at 
org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:407)
        at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:222)

2012-11-01 08:55:15,695 ERROR [solr.cloud.RecoveryStrategy] - [RecoveryThread] 
- : Error while trying to recover:org.apache.solr.common.SolrException: 
Replication for recovery failed.
        at 
org.apache.solr.cloud.RecoveryStrategy.replicate(RecoveryStrategy.java:155)
        at 
org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:407)
        at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:222)

2012-11-01 08:55:15,696 INFO [solr.update.UpdateLog] - [RecoveryThread] - : 
Dropping buffered updates FSUpdateLog{state=BUFFERING, tlog=null}
2012-11-01 08:55:15,696 ERROR [solr.cloud.RecoveryStrategy] - [RecoveryThread] 
- : Recovery failed - trying again... (8) core=shard_i
2012-11-01 08:55:15,696 INFO [solr.cloud.RecoveryStrategy] - [RecoveryThread] - 
: Wait 512.0 seconds before trying to recover again (9)


It seems to me the SolrCloud part is actually running and behaving as it should 
but the index file on the leader is corrupt, and that should not happen even on 
power loss. Any hints?


Thanks
Markus
 
 
-----Original message-----
> From:Markus Jelsma <markus.jel...@openindex.io>
> Sent: Wed 31-Oct-2012 14:14
> To: solr-user@lucene.apache.org; Markus Jelsma <markus.jel...@openindex.io>
> Subject: RE: trunk is unable to replicate between nodes ( Unable to download 
> ... completely)
> 
> Ah, we're also seeing Solr lookup an unexisting directory:
> 
> 2012-10-30 16:32:26,578 ERROR [handler.admin.CoreAdminHandler] - 
> [http-8080-exec-2] - : IO error while trying to get the size of the 
> Directory:org.apache.lucene.store.NoSuchDirectoryException: directory 
> '/opt/solr/cores/shard_a/data/index' does not exist
>         at org.apache.lucene.store.FSDirectory.listAll(FSDirectory.java:220)
>         at org.apache.lucene.store.FSDirectory.listAll(FSDirectory.java:243)
>         at 
> org.apache.lucene.store.NRTCachingDirectory.listAll(NRTCachingDirectory.java:132)
>         at 
> org.apache.solr.core.DirectoryFactory.sizeOfDirectory(DirectoryFactory.java:146)
> 
> Instead of data/index it should be looking for data/index.20121030152324761/, 
> which actually does exist.
> 
>  
>  
> -----Original message-----
> > From:Markus Jelsma <markus.jel...@openindex.io>
> > Sent: Tue 30-Oct-2012 17:30
> > To: solr-user@lucene.apache.org
> > Subject: trunk is unable to replicate between nodes ( Unable to download 
> > ... completely)
> > 
> > Hi,
> > 
> > We're testing again with today's trunk and using the new Lucene 4.1 format 
> > by default. When nodes are not restarted things are kind of stable but 
> > restarting nodes leads to a lot of mayhem. It seems we can get the cluster 
> > back up and running by clearing ZK and restarting everything (another 
> > issue) but replication becomes impossible for some nodes leading to a 
> > continuous state of failing recovery etc.
> > 
> > Here are some excepts from the logs:
> > 
> > 2012-10-30 16:12:39,674 ERROR [solr.servlet.SolrDispatchFilter] - 
> > [http-8080-exe
> > c-5] - : null:java.lang.IndexOutOfBoundsException
> >         at java.nio.Buffer.checkBounds(Buffer.java:530)
> >         at java.nio.DirectByteBuffer.get(DirectByteBuffer.java:218)
> >         at 
> > org.apache.lucene.store.ByteBufferIndexInput.readBytes(ByteBufferInde
> > xInput.java:91)
> >         at 
> > org.apache.solr.handler.ReplicationHandler$DirectoryFileStream.write(
> > ReplicationHandler.java:1065)
> >         at 
> > org.apache.solr.handler.ReplicationHandler$3.write(ReplicationHandler.java:932)
> > 
> > 
> > 2012-10-30 16:10:32,220 ERROR [solr.handler.ReplicationHandler] - 
> > [RecoveryThrea
> > d] - : SnapPull failed :org.apache.solr.common.SolrException: Unable to 
> > download
> >  _x.fdt completely. Downloaded 13631488!=13843504
> >         at 
> > org.apache.solr.handler.SnapPuller$DirectoryFileFetcher.cleanup(SnapP
> > uller.java:1237)
> >         at 
> > org.apache.solr.handler.SnapPuller$DirectoryFileFetcher.fetchFile(Sna
> > pPuller.java:1118)
> >         at 
> > org.apache.solr.handler.SnapPuller.downloadIndexFiles(SnapPuller.java
> > :716)
> >         at 
> > org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:387)
> >         at 
> > org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:273)
> >         at 
> > org.apache.solr.cloud.RecoveryStrategy.replicate(RecoveryStrategy.java:152)
> >         at 
> > org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:407)
> > 
> > 2012-10-30 16:12:51,061 WARN [solr.handler.ReplicationHandler] - 
> > [http-8080-exec
> > -3] - : Exception while writing response for params: 
> > file=_p_Lucene41_0.doc&comm
> > and=filecontent&checksum=true&generation=6&qt=/replication&wt=filestream
> > java.io.EOFException: read past EOF: 
> > MMapIndexInput(path="/opt/solr/cores/openindex_h/data/index.20121030152234973/_p_Lucene41_0.doc")
> >         at 
> > org.apache.lucene.store.ByteBufferIndexInput.readBytes(ByteBufferIndexInput.java:100)
> >         at 
> > org.apache.solr.handler.ReplicationHandler$DirectoryFileStream.write(ReplicationHandler.java:1065)
> >         at 
> > org.apache.solr.handler.ReplicationHandler$3.write(ReplicationHandler.java:932)
> > 
> > 
> > Needless to say i'm puzzled so i'm wondering if anyone has seen this before 
> > or have some hints that might help digg further.
> > 
> > Thanks,
> > Markus
> > 
> 

Reply via email to