This sounds a lot like: https://issues.apache.org/jira/browse/SOLR-9706
Could you attach your patch to that issue if you think it's the same? And please copy/paste your e-mail in a comment if you would, you've obviously done more research on the cause than I did and that'd save some work whenever someone picks it up. It's unclear to me whether this is intentional behavior or an accident of code, either way having a place to start when analyzing is much appreciated. Best, Erick On Tue, Nov 22, 2016 at 10:02 AM, Jeremy Hoy <j...@findmypast.com> wrote: > Hi All, > > We're running a fairly non-standard solr configuration. We ingest into named > shards in master cores and then replicate out to slaves running solr cloud. > So in effect we are using solrcloud only to manage the config files and more > importantly to look after the cluster state. Our corpus and search workload, > is such that this makes sense to reduce the need to query every shard for > each search since the majority of queries contain values that allow is to > target search towards the shards holding the appropriate documents, also this > isolates the searching slaves from the costs of indexing (we index fairly > infrequently, but in fairly large volumes). I'm happy to expand on this if > anyone's is interested or take suggestions as to how to we might better be > doing things. > > We've been running 4.6.0 for the past 3 years or so, but have recently > upgraded to 5.5.2 - we'll likely be upgrading to 6.3.0 shortly. However we > hit a problem when running 5.5.2, which we also replicated in 6.2.1 and > 6.3.0. When a partial replication starts this will usually block all > subsequent requests to solr, whilst replication continues in the background. > Whilst in this blocked state we took thread dumps using VisualVM; we see this > when running 6.3.0: > > "explicit-fetchindex-cmd" - Thread t@71 > java.lang.Thread.State: RUNNABLE > at java.net.SocketInputStream.socketRead0(Native Method) > ...... > at > org.apache.solr.handler.IndexFetcher$FileFetcher.fetchPackets(IndexFetcher.java:1463) > at > org.apache.solr.handler.IndexFetcher$FileFetcher.fetchFile(IndexFetcher.java:1429) > at > org.apache.solr.handler.IndexFetcher.downloadIndexFiles(IndexFetcher.java:855) > at > org.apache.solr.handler.IndexFetcher.fetchLatestIndex(IndexFetcher.java:434) > at > org.apache.solr.handler.IndexFetcher.fetchLatestIndex(IndexFetcher.java:251) > at > org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:397) > at > org.apache.solr.handler.ReplicationHandler.lambda$handleRequestBody$0(ReplicationHandler.java:279) > at > org.apache.solr.handler.ReplicationHandler$$Lambda$82/776974667.run(Unknown > Source) > at java.lang.Thread.run(Thread.java:745) > > Locked ownable synchronizers: > - locked <4c18799d> (a > java.util.concurrent.locks.ReentrantLock$NonfairSync) > > - locked <64a00f> (a > java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync) > > and > > "qtp1873653341-61" - Thread t@61 > java.lang.Thread.State: TIMED_WAITING > at sun.misc.Unsafe.park(Native Method) > - waiting to lock <64a00f> (a > java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync) owned by > "explicit-fetchindex-cmd" t@71 > at > java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328) > at > java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.tryLock(ReentrantReadWriteLock.java:871) > at > org.apache.solr.update.DefaultSolrCoreState.lock(DefaultSolrCoreState.java:159) > at > org.apache.solr.update.DefaultSolrCoreState.getIndexWriter(DefaultSolrCoreState.java:104) > at > org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1781) > at > org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1931) > at > org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1677) > at > org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1577) > ..... > > > The cause of the problem seems to be that in IndexFetcher.fetchLatestIndex, > when the running as solrcloud, the searcher is shut down prior to cleaning up > the existing segment files and downloading the new ones. > > 6.3.0 - Lines(407-409) > if > (solrCore.getCoreDescriptor().getCoreContainer().isZooKeeperAware()) { > solrCore.closeSearcher(); > } > > Subsequently solrCore.getUpdateHandler().newIndexWriter(true); takes a write > lock on the indexwriter, which is not released until the openIndexWriter call > after the new files have been copied. So because openNewSearcher needs to > take a read lock on the index writer, and it can't take that whilst the write > lock is in place, all subsequent requests are blocked. > > To test this we queued up a load of search requests, then manually triggered > replication, reasoning that a new searcher might be created before the write > lock is taken. On a test instance manually triggering replication would > almost always result in all subsequent requests being blocked, but when we > queued up search requests and ran these whilst triggering replication this > never resulted in the blocking behaviour we were seeing. > > We then patched solr locally, to comment out the closeSearcher call, on the > basis that whilst we are running solrcloud, if the core is also running as a > slave there is no need to close the searcher. This seems to work fine; > replication works, nothing hangs. > > This seems like a bug to me, but we could find no other reports of the > problem. > > So my questions are: Is it worth raising an issue in JIRA and working up a > proper patch? Or is our setup so unique there is little value to this? Or > am I missing something else? > > Thanks, > > Jeremy > > > > > > > ________________________________ > This message is confidential and may contain privileged information. You > should not disclose its contents to any other person. If you are not the > intended recipient, please notify the sender named above immediately. It is > expressly declared that this e-mail does not constitute nor form part of a > contract or unilateral obligation. Opinions, conclusions and other > information in this message that do not relate to the official business of > findmypast shall be understood as neither given nor endorsed by it. > ________________________________ > > __________________________________________________________________________ > > This email has been checked for virus and other malicious content prior to > leaving our network. > __________________________________________________________________________