Hmmm, I'm puzzled then. I'm guessing that the node that keeps going down is the follower, which means it should have _less_ work to do than the node that stays up. Not a lot less, but less still.
I'd try lengthening out my commit interval. I realize you've set it to 2 seconds for a reason, this is mostly to see if it has any effect and have a place to _start_ looking. I'm assuming your hard commit has openSearcher set to false. Just to double check, these two nodes are just a leader and follower, right? IOW, they're part of the same collection, your collection just has one shard. m/c configuration? What's that? If it's a typo for m/s (master/slave) then that may be an issue. In a SolrCloud setup there is no master/slave and you shouldn't configure them.... Best, Erick On Wed, Sep 3, 2014 at 8:52 PM, Ethan <eh198...@gmail.com> wrote: > Erick, > > It is just one shard. Indexing traffic is going to the other node and then > synched with this one(both are part of cloud). We kept that setting > running for 5 days as defective node would just go down with search > traffic. So both were in sync when search was turned on. Soft commit is > very low, around 2 secs, but that doesn't seem to affect the other node > which is functioning normally. > > Memory settings for both nodes are identical, including m/c configuration. > > On Wed, Sep 3, 2014 at 4:23 PM, Erick Erickson <erickerick...@gmail.com> > wrote: > >> Do you have indexing traffic going to it? b/c this _looks_ >> like the node is just starting up or a searcher is >> being opened and you're loading your >> index first time. This happens when you index data and >> when you start up your nodes. Adding some autowarming >> (firstSearcher in this case) might load up the underlying >> caches earlier. This could also be a problem due to >> very short commit intervals, although this latter should >> be identical for both nodes. >> >> And when you say 2 solr nodes, is this one shard or two? >> >> I'm guessing that you have some setting that's significantly >> different, memory perhaps? >> >> Best, >> Erick >> >> >> >> On Wed, Sep 3, 2014 at 2:40 PM, Ethan <eh198...@gmail.com> wrote: >> > Forgot to add the source thread thats blocking every other thread >> > >> > >> > "http-bio-52158-exec-61" - Thread t@591 >> > java.lang.Thread.State: RUNNABLE >> > at >> > >> org.apache.lucene.search.FieldCacheImpl$Uninvert.uninvert(FieldCacheImpl.java:312) >> > at >> > >> org.apache.lucene.search.FieldCacheImpl$LongCache.createValue(FieldCacheImpl.java:986) >> > at >> > >> org.apache.lucene.search.FieldCacheImpl$Cache.get(FieldCacheImpl.java:212) >> > - locked org.apache.lucene.search.FieldCache$CreationPlaceholder@29e0400b >> > at >> > org.apache.lucene.search.FieldCacheImpl.getLongs(FieldCacheImpl.java:901) >> > at >> > >> org.apache.lucene.search.FieldComparator$LongComparator.setNextReader(FieldComparator.java:685) >> > at >> > >> org.apache.lucene.search.TopFieldCollector$OneComparatorNonScoringCollector.setNextReader(TopFieldCollector.java:97) >> > at >> > >> org.apache.lucene.search.TimeLimitingCollector.setNextReader(TimeLimitingCollector.java:158) >> > at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:618) >> > at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:297) >> > at >> > >> org.apache.solr.search.SolrIndexSearcher.getDocListNC(SolrIndexSearcher.java:1501) >> > at >> > >> org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:1367) >> > at >> > >> org.apache.solr.search.SolrIndexSearcher.search(SolrIndexSearcher.java:474) >> > at >> > >> org.apache.solr.handler.component.QueryComponent.process(QueryComponent.java:434) >> > at >> > >> org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:208) >> > at >> > >> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) >> > at org.apache.solr.core.SolrCore.execute(SolrCore.java:1859) >> > at >> > >> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:703) >> > at >> > >> com.trimp.search.filter.LogAndAuthFilter.execute(LogAndAuthFilter.scala:109) >> > at >> > >> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:406) >> > at >> > >> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:195) >> > at >> > >> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243) >> > at >> > >> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210) >> > at >> > >> org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:222) >> > at >> > >> org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:123) >> > at >> > >> org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:171) >> > at >> > >> org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:99) >> > at >> org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:947) >> > at >> org.apache.catalina.valves.RemoteIpValve.invoke(RemoteIpValve.java:680) >> > at >> > >> org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:118) >> > at >> > >> org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:408) >> > at >> > >> org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1009) >> > at >> > >> org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:589) >> > at >> > >> org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:310) >> > - locked org.apache.tomcat.util.net.SocketWrapper@7826692 >> > at >> > >> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) >> > at >> > >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) >> > at java.lang.Thread.run(Thread.java:722) >> > >> > Locked ownable synchronizers: >> > - locked java.util.concurrent.ThreadPoolExecutor$Worker@2463aef >> > >> > >> > On Wed, Sep 3, 2014 at 2:31 PM, Ethan <eh198...@gmail.com> wrote: >> > >> >> We have SolrCloud instance with 2 solr nodes and 3 zk ensemble. One of >> >> the solr node goes down as soon as we send search traffic to it, but >> update >> >> works fine. >> >> >> >> When I analyzed thread dump I saw lot of blocked threads with following >> >> error message. This explains why it couldn't create any native threads >> and >> >> ran out of memory. The thread count went from 48 to 900 within minutes >> and >> >> server came down. The other node with same configuration is taking all >> the >> >> search and update traffic, and it running fine. >> >> >> >> Any pointers would be appreciated. >> >> >> >> http-bio-52158-exec-59" - Thread t@589 >> >> java.lang.Thread.State: BLOCKED on >> >> org.apache.lucene.search.FieldCache$CreationPlaceholder@29e0400b owned >> >> by: http-bio-52158-exec-61 >> >> at >> >> >> org.apache.lucene.search.FieldCacheImpl$Cache.get(FieldCacheImpl.java:209) >> >> at >> >> >> org.apache.lucene.search.FieldCacheImpl.getLongs(FieldCacheImpl.java:901) >> >> at >> >> >> org.apache.lucene.search.FieldComparator$LongComparator.setNextReader(FieldComparator.java:685) >> >> at >> >> >> org.apache.lucene.search.TopFieldCollector$OneComparatorNonScoringCollector.setNextReader(TopFieldCollector.java:97) >> >> at >> >> >> org.apache.lucene.search.TimeLimitingCollector.setNextReader(TimeLimitingCollector.java:158) >> >> at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:618) >> >> at >> org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:297) >> >> at >> >> >> org.apache.solr.search.SolrIndexSearcher.getDocListNC(SolrIndexSearcher.java:1501) >> >> at >> >> >> org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:1367) >> >> at >> >> >> org.apache.solr.search.SolrIndexSearcher.search(SolrIndexSearcher.java:474) >> >> at >> >> >> org.apache.solr.handler.component.QueryComponent.process(QueryComponent.java:434) >> >> at >> >> >> org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:208) >> >> at >> >> >> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) >> >> at org.apache.solr.core.SolrCore.execute(SolrCore.java:1859) >> >> at >> >> >> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:703) >> >> at >> >> >> com.trimp.search.filter.LogAndAuthFilter.execute(LogAndAuthFilter.scala:109) >> >> at >> >> >> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:406) >> >> at >> >> >> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:195) >> >> at >> >> >> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243) >> >> at >> >> >> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210) >> >> at >> >> >> org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:222) >> >> at >> >> >> org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:123) >> >> at >> >> >> org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:171) >> >> at >> >> >> org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:99) >> >> at >> >> >> org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:947) >> >> at >> org.apache.catalina.valves.RemoteIpValve.invoke(RemoteIpValve.java:680) >> >> at >> >> >> org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:118) >> >> at >> >> >> org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:408) >> >> at >> >> >> org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1009) >> >> at >> >> >> org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:589) >> >> at >> >> >> org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:310) >> >> - locked org.apache.tomcat.util.net.SocketWrapper@5b4530c8 >> >> at >> >> >> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) >> >> at >> >> >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) >> >> at java.lang.Thread.run(Thread.java:722) >> >> >> >> Locked ownable synchronizers: >> >> - locked java.util.concurrent.ThreadPoolExecutor$Worker@63d2720 >> >> >> >> -E >> >> >>