[ https://issues.apache.org/jira/browse/SOLR-14452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17097138#comment-17097138 ]
David Smiley commented on SOLR-14452: ------------------------------------- I think your master branch is out of date. March 17th in SOLR-14256 I fixed this bug which had been around for a month. > "classloading deadlock" issue with DocSet/SortedIntDocSet > --------------------------------------------------------- > > Key: SOLR-14452 > URL: https://issues.apache.org/jira/browse/SOLR-14452 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Reporter: Chris M. Hostetter > Priority: Major > > While beasting some facet related cloud tests on master, I noticed a pattern > of occasional failures that seemed to crop up... > * test ultimately fails due to a time out (usually the client threads time > out waiting for a server response) > * if i notice my CPU isn't spinning very hard _before_ the test fails, I can > capture a jstack and inspect some threads > * there will be multiple jetty/solr request threads (ex: > {{"qtp82184175-145"}} whose stack traces show various stages of DocSet > collection that show they are {{"... in Object.wait()"}} but also {{RUNNABLE}} > ...this isn't a thread summary+state combination that i'm use to seeing when > looking at thread dumps, and some research into when/why this might happen > lead me to: > * > [https://stackoverflow.com/questions/28631656/runnable-thread-state-but-in-object-wait] > ** [https://stackoverflow.com/a/28776438/689372] > *** > **** > [http://ternarysearch.blogspot.com/2013/07/static-initialization-deadlock.html] > **** [https://bugs.openjdk.java.net/browse/JDK-8037567] > ...while the comments/status of JDK-8037567 suggests "nothing wrong here" the > overall symptoms/description of the problem in the SO answer and linked blog > and summation that this is essentially a "deadlock" situation in the class > loader, do seem to correlate to some of the specifics I can see in the stack > traces when this happens while running solr tests... > * at least one "RUNNABLE / Object.wait" thread trying to do class init; > class: DocSet... > {noformat} > "qtp1535326437-68" #68 prio=5 os_prio=0 cpu=72.48ms elapsed=241.69s > tid=0x00007fc08c0a4000 nid=0x864 in Object.wait() [0x00007fc0adedd000] > java.lang.Thread.State: RUNNABLE > at org.apache.solr.search.DocSet.<clinit>(DocSet.java:118) > at > org.apache.solr.search.DocSetCollector.getDocSet(DocSetCollector.java:90) // > "new BitDocSet(..)" > at org.apache.solr.search.DocSetUtil.getDocSet(DocSetUtil.java:93) > at > org.apache.solr.search.SolrIndexSearcher.getDocListAndSetNC(SolrIndexSearcher.java:1730) > {noformat} > * other "RUNNABLE / Object.wait" threads are on lines that involve > instantiating a subclass of DocSet: > ** > {noformat} > "qtp1535326437-67" #67 prio=5 os_prio=0 cpu=801.44ms elapsed=241.69s > tid=0x00007fc08c0a1800 nid=0x863 in Object.wait() [0x00007fc0adfdf000] > java.lang.Thread.State: RUNNABLE > at > org.apache.solr.search.DocSetCollector.getDocSet(DocSetCollector.java:90) // > "new BitDocSet(..)" > at org.apache.solr.search.DocSetUtil.getDocSet(DocSetUtil.java:93) > at > org.apache.solr.search.SolrIndexSearcher.getDocListAndSetNC(SolrIndexSearcher.java:1730) > {noformat} > ** > {noformat} > "qtp82184175-65" #65 prio=5 os_prio=0 cpu=137.76ms elapsed=241.69s > tid=0x00007fc088092000 nid=0x860 in Object.wait() [0x00007fc0ae2e2000] > java.lang.Thread.State: RUNNABLE > at > org.apache.solr.search.DocSetCollector.getDocSet(DocSetCollector.java:84) // > "new SortedIntDocSet(..)" > at org.apache.solr.search.DocSetUtil.getDocSet(DocSetUtil.java:93) > at > org.apache.solr.search.SolrIndexSearcher.getDocListAndSetNC(SolrIndexSearcher.java:1730) > at > org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:1433) > {noformat} > ** etc... > * DocSet has a static reference to a concrete subclass... > ** {{public static final DocSet EMPTY = new SortedIntDocSet(new int[0], 0); > ---- > I should point out: > * While this particular "class loading deadlock" issue seems more likely to > happen in a "test" situation where the JVMs/classloaders are short lived, > there's no reason to assume this type of failure couldn't happen in a > production solr instance when handling a burst of queries right after startup. > * This type of failure (either specifically due to "DocSet vs > SortedIntDocSet", or due to similar patterns in other classes) may also be > the root cause of various other hard to reproduce "timed out" test failures > we've seen over the years. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org