I'll give you the high level before delving deep into setup etc. I have been 
struggeling at work with a seemingly random problem when solr will hang for 
10-15 minutes during updates.  This outage always seems to immediately be 
proceeded by an EOF exception on  the replica.  Then 10-15 minutes later we see 
an exception on the leader for a socket timeout to the replica.  The leader 
will then tell the replica to recover which in most cases it does and then the 
outage is over.

Here are the setup details:

We are currently using Solr 4.0.0 with an external ZK ensemble of 5 machines. 
We have 2 active collections each with only 1 shard (we have in total about 15 
collections but most are empty or have less than 100 docs). The first index 
(collection1) is 6.5GB and has ~18M documents.  The 2nd index (collection2) is 
9GB and has about 13M documents. In all cases the leader resides on 1 server 
and the replica resides on the other.  Both servers are AWS XL High Mem 
instances. (8 CPUs @ 2.67Ghz, 70GB Ram) with the index residing on a 1TB raid 
10 using ephemeral storage disks.  We are starting solr using the embedded 
jetty with the following java memory and GC options:

-Xmx16382m -Xms4092m -XX:MaxPermSize=256m -Xss256k -XX:NewSize=1536m 
-XX:SurvivorRatio=16 -XX:+DisableExplicitGC -XX:+UseConcMarkSweepGC 
-XX:ParallelCMSThreads=2 -XX:+CMSClassUnloadingEnabled 
-XX:+UseCMSCompactAtFullCollection -XX:CMSInitiatingOccupancyFraction=80 
-XX:+CMSParallelRemarkEnabled

Both collections receive a constant stream of updates ~10k per hour (both 
adds/deletes).  Approximately once per day the following events transpire:


 1.  We see a log entry for a distributed update that takes just over 50000 ms 
followed by an EOF write exception on the replica. In all cases this exception 
is triggered by an update to the 9GB collection.
 2.  Occasionally we'll see a 503 shard update error on the leader but usually 
not.
 3.  Approximately 15 minutes after this exception we see a timeout error for a 
this distributed update request on the leader.
 4.  The leader then creates a new connection and tells the replica to recover, 
which it does and everything is OK again.
 5.  During the 15 minute window from when the replica throws the EOF until the 
SocketTimeout by the leader no other updates are processed:

ERROR ON REPLICA:

Jul 8, 2013 6:38:16 PM org.apache.solr.core.SolrCore execute
INFO: [collection2_0] webapp=/solr path=/update 
params={distrib.from=http://Solr4-1-1.domain.com:8983/solr/collection2_0/&update.distrib=FROMLEADER&wt=javabin&version=2}
 status=0 QTime=50012

Jul 8, 2013 6:38:16 PM org.apache.solr.common.SolrException log
SEVERE: null:org.eclipse.jetty.io.EofException
        at org.eclipse.jetty.server.HttpOutput.write(HttpOutput.java:154)
        at org.eclipse.jetty.server.HttpOutput.write(HttpOutput.java:101)
        at 
org.apache.solr.common.util.FastOutputStream.flush(FastOutputStream.java:203)
        at 
org.apache.solr.common.util.FastOutputStream.flushBuffer(FastOutputStream.java:196)
        at 
org.apache.solr.common.util.JavaBinCodec.marshal(JavaBinCodec.java:94)
        at 
org.apache.solr.response.BinaryResponseWriter.write(BinaryResponseWriter.java:49)
        at 
org.apache.solr.servlet.SolrDispatchFilter.writeResponse(SolrDispatchFilter.java:404)
        at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:289)
        at 
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1337)
        at 
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:484)
        at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:119)
        at 
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:524)
        at 
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:233)
        at 
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1065)
        at 
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:413)
        at 
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:192)
        at 
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:999)
        at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:117)
        at 
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:250)
        at 
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:149)
        at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:111)
        at org.eclipse.jetty.server.Server.handle(Server.java:351)
        at 
org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:454)
        at 
org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:47)
        at 
org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:900)
        at 
org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:954)
        at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:952)
        at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
        at 
org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:66)
        at 
org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:254)
        at 
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:599)
        at 
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:534)
        at java.lang.Thread.run(Thread.java:662)

ERROR LEADER:

================ 10 MINUTES AFTER EOF (sometimes) ================

Jul 8, 2013 6:47:32 PM org.apache.solr.common.SolrException log
SEVERE: shard update error StdNode: 
http://Solr4-1-2.domain.com:8983/solr/collection2_0/

Jul 8, 2013 6:47:32 PM org.apache.solr.common.SolrException log
SEVERE: shard update error StdNode: 
http://Solr4-1-2.domain.com:8983/solr/collection2_0/:org.apache.solr.common.SolrException:
 Server at http://Solr4-1-3.domain.com:8983/solr/collection2_0 returned non ok 
status:503, message:Service Unavailable
        at 
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:372)
        at 
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
        at 
org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:335)
        at 
org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:309)
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:662)

================ 15 MINUTES AFTER EOF  ================

Jul 8, 2013 6:52:56 PM org.apache.solr.common.SolrException log
SEVERE: shard update error StdNode: 
http://Solr4-1-2.domain.com:8983/solr/collection2_0/:org.apache.solr.client.solrj.SolrServerException:
 IOException occured when talking to server at: 
http://Solr4-1-2.domain.com:8983/solr/collection2_0
        at 
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:413)
        at 
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
        at 
org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:335)
        at 
org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:309)
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:662)
Caused by: java.net.SocketException: Connection timed out
        at java.net.SocketInputStream.socketRead0(Native Method)
        at java.net.SocketInputStream.read(SocketInputStream.java:129)
        at 
org.apache.http.impl.io.AbstractSessionInputBuffer.fillBuffer(AbstractSessionInputBuffer.java:149)
        at 
org.apache.http.impl.io.SocketInputBuffer.fillBuffer(SocketInputBuffer.java:111)
        at 
org.apache.http.impl.io.AbstractSessionInputBuffer.readLine(AbstractSessionInputBuffer.java:264)
        at 
org.apache.http.impl.conn.LoggingSessionInputBuffer.readLine(LoggingSessionInputBuffer.java:115)
        at 
org.apache.http.impl.conn.DefaultResponseParser.parseHead(DefaultResponseParser.java:98)
        at 
org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:252)
        at 
org.apache.http.impl.AbstractHttpClientConnection.receiveResponseHeader(AbstractHttpClientConnection.java:282)
        at 
org.apache.http.impl.conn.DefaultClientConnection.receiveResponseHeader(DefaultClientConnection.java:247)
        at 
org.apache.http.impl.conn.AbstractClientConnAdapter.receiveResponseHeader(AbstractClientConnAdapter.java:216)
        at 
org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:298)
        at 
org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:125)
        at 
org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:647)
        at 
org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:464)
        at 
org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:820)
        at 
org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:754)
        at 
org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:732)
        at 
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:352)
        ... 11 more

Jul 8, 2013 6:52:56 PM 
org.apache.solr.update.processor.DistributedUpdateProcessor doFinish
INFO: try and ask http://Solr4-1-2.domain.com:8983/solr to recover

Jul 8, 2013 6:52:56 PM org.apache.solr.client.solrj.impl.HttpClientUtil 
createClient
INFO: Creating new http client, 
config:maxConnections=128&maxConnectionsPerHost=32&followRedirects=false

The machines are certainly not under any load.  During the outage disk, 
network, and cpu levels were dead almost as if nothing was happening.  
Initially we thought it could be related to GC since we are using the CMS stuff 
and we have a lot of allocated memory, but the outages are not consistent with 
the GC times that were logged.  After doing some digging we found that jetty is 
configured with a default maxIdleTime of 50000 ms.  We thought that this could 
explain the EOF exception in this way:
Solr4-1-2 accepts an update from the leader, takes to long to do whatever and 
jetty closes the socket. Then when it goes to finish writing it's response to 
the socket object it throws an EOF exception.  However I am not able to 
reproduce this!  I set the maxIdleTime to 100ms on both machines, run it under 
the same load but never get the error!

Does anyone have any suggestions/expertise that could explain this, or what we 
can try to further gather information/cause?  This has been going on for 3 
weeks now and I'M really starting to get mad about it.  I'M happy to send logs, 
and configurations privately to anyone who is willing to help us chase this 
down.

Thanks!

Jed

Reply via email to