This sounds a lot like:

https://issues.apache.org/jira/browse/SOLR-9706


Could you attach your patch to that issue if you think it's the same?
And please copy/paste your e-mail in a comment if you would, you've
obviously done more research on the cause than I did and that'd save
some work whenever someone picks it up.

It's unclear to me whether this is intentional behavior or an accident
of code, either way having a place to start when analyzing is much
appreciated.


Best,
Erick

On Tue, Nov 22, 2016 at 10:02 AM, Jeremy Hoy <j...@findmypast.com> wrote:
> Hi All,
>
> We're running a fairly non-standard solr configuration.  We ingest into named 
> shards in master cores and then replicate out to slaves running solr cloud.  
> So in effect we are using solrcloud only to manage the config files and more 
> importantly to look after the cluster state.  Our corpus and search workload, 
> is such that this makes sense to reduce the need to query every shard for 
> each search since the majority of queries contain values that allow is to 
> target search towards the shards holding the appropriate documents, also this 
> isolates the searching slaves from the costs of indexing (we index fairly 
> infrequently, but in fairly large volumes).  I'm happy to expand on this if 
> anyone's is interested or take suggestions as to how to we might better be 
> doing things.
>
> We've been running 4.6.0 for the past 3 years or so, but have recently 
> upgraded to 5.5.2 - we'll likely be upgrading to 6.3.0 shortly.   However we 
> hit a problem when running 5.5.2, which we also replicated in 6.2.1 and 
> 6.3.0.  When a partial replication starts this will usually block all 
> subsequent requests to solr, whilst replication continues in the background.  
> Whilst in this blocked state we took thread dumps using VisualVM; we see this 
> when running 6.3.0:
>
> "explicit-fetchindex-cmd" - Thread t@71
>    java.lang.Thread.State: RUNNABLE
>                 at java.net.SocketInputStream.socketRead0(Native Method)
>                 ......
>                 at 
> org.apache.solr.handler.IndexFetcher$FileFetcher.fetchPackets(IndexFetcher.java:1463)
>                 at 
> org.apache.solr.handler.IndexFetcher$FileFetcher.fetchFile(IndexFetcher.java:1429)
>                 at 
> org.apache.solr.handler.IndexFetcher.downloadIndexFiles(IndexFetcher.java:855)
>                 at 
> org.apache.solr.handler.IndexFetcher.fetchLatestIndex(IndexFetcher.java:434)
>                 at 
> org.apache.solr.handler.IndexFetcher.fetchLatestIndex(IndexFetcher.java:251)
>                 at 
> org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:397)
>                 at 
> org.apache.solr.handler.ReplicationHandler.lambda$handleRequestBody$0(ReplicationHandler.java:279)
>                 at 
> org.apache.solr.handler.ReplicationHandler$$Lambda$82/776974667.run(Unknown 
> Source)
>                 at java.lang.Thread.run(Thread.java:745)
>
>    Locked ownable synchronizers:
>                 - locked <4c18799d> (a 
> java.util.concurrent.locks.ReentrantLock$NonfairSync)
>
>                 - locked <64a00f> (a 
> java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
>
> and
>
> "qtp1873653341-61" - Thread t@61
>    java.lang.Thread.State: TIMED_WAITING
>                 at sun.misc.Unsafe.park(Native Method)
>                 - waiting to lock <64a00f> (a 
> java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync) owned by 
> "explicit-fetchindex-cmd" t@71
>                 at 
> java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
>                 at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037)
>                 at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328)
>                 at 
> java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.tryLock(ReentrantReadWriteLock.java:871)
>                 at 
> org.apache.solr.update.DefaultSolrCoreState.lock(DefaultSolrCoreState.java:159)
>                 at 
> org.apache.solr.update.DefaultSolrCoreState.getIndexWriter(DefaultSolrCoreState.java:104)
>                 at 
> org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1781)
>                 at 
> org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1931)
>                 at 
> org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1677)
>                 at 
> org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1577)
>                 .....
>
>
> The cause of the problem seems to be that in IndexFetcher.fetchLatestIndex, 
> when the running as solrcloud, the searcher is shut down prior to cleaning up 
> the existing segment files and downloading the new ones.
>
> 6.3.0 - Lines(407-409)
>                 if 
> (solrCore.getCoreDescriptor().getCoreContainer().isZooKeeperAware()) {
> solrCore.closeSearcher();
>                 }
>
> Subsequently solrCore.getUpdateHandler().newIndexWriter(true); takes a write 
> lock on the indexwriter, which is not released until the openIndexWriter call 
> after the new files have been copied.  So because openNewSearcher needs to 
> take a read lock on the index writer, and it can't take that whilst the write 
> lock is in place, all subsequent requests are blocked.
>
> To test this we queued up a load of search requests, then manually triggered 
> replication, reasoning that a new searcher might be created before the write 
> lock is taken.  On a test instance manually triggering replication would 
> almost always result in all subsequent requests being blocked, but when we 
> queued up search requests and ran these whilst triggering replication this 
> never resulted in the blocking behaviour we were seeing.
>
> We then patched solr locally, to comment out the closeSearcher call, on the 
> basis that whilst we are running solrcloud, if the core is also running as a 
> slave there is no need to close the searcher.  This seems to work fine; 
> replication works, nothing hangs.
>
> This seems like a bug to me, but we could find no other reports of the 
> problem.
>
> So my questions are:  Is it worth raising an issue in JIRA and working up a 
> proper patch?  Or is our setup so unique there is little value to this?  Or 
> am I missing something else?
>
> Thanks,
>
> Jeremy
>
>
>
>
>
>
> ________________________________
> This message is confidential and may contain privileged information. You 
> should not disclose its contents to any other person. If you are not the 
> intended recipient, please notify the sender named above immediately. It is 
> expressly declared that this e-mail does not constitute nor form part of a 
> contract or unilateral obligation. Opinions, conclusions and other 
> information in this message that do not relate to the official business of 
> findmypast shall be understood as neither given nor endorsed by it.
> ________________________________
>
> __________________________________________________________________________
>
> This email has been checked for virus and other malicious content prior to 
> leaving our network.
> __________________________________________________________________________

Reply via email to