[jira] [Commented] (SOLR-11988) FullSolrCloudDistribCmdsTest failures due to SolrCore initializating incorrectly thinking index directory already exists?
[ https://issues.apache.org/jira/browse/SOLR-11988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16366437#comment-16366437 ] Hoss Man commented on SOLR-11988: - Ok, the underlying problem seems to be: * SolrCore's usage of DirectoryFactory assumes particular semantics/meaning of how {{DirectoryFactory.exists(String)}} will behave ** Notably: that it returns {{true}} only if the directory exists "on disk" *AND* contains at least one file * EphemeralDirectoryFactory (and by extension MockDirectoryFactory) does not implement {{exists(String)}} in a manner consistent with that expectation: ** EphemeralDirectoryFactory's impl returns {{true}} anytime the specified path name has an entry in the cache – even if no files have been created in that directory Notable if you have code like this {code:java} String path = ...; // some completley new never before seen directory name DirectoryFactory dirFac = ...; assertFalse(path + " should not exist yet", dirFac.exists(path)); Directory dir = dirFac.get(path, DirectoryFactory.DirContext.DEFAULT, DirectoryFactory.LOCK_TYPE_SINGLE); assertFalse(path + " should still not exist", dirFac.exists(path)); {code} ...that second assert will fail for {{MockDirectoryFactory}}, but succeed for all other DirectoryFactory impls. This problem can ultimately manifests itself with the previously mentioned {{IndexNotFoundException: no segments* file found in ...}} exception if the following situation arrises... # SolrCore starts initializing # A background thread (such as the monitors used for autoscaling) polls the metrics gauges for the SolrCore ** one of these gauges calls {{SolrCore.getIndexSize()}} ** This method calls {{directoryFactory.get(getIndexDir(),...)}} – which in the case of MockDirectoryFactory populates the "cache" for the {{getIndexDir()}} path # *After* some other thread has called {{SolrCore.getIndexSize()}}, then the core initialization thread makes it to {{SolrCore.initIndex(...)}} ** this calls {{getDirectoryFactory().exists(indexDir);}} to decide if we should attempt to initialize a new index ** Because of the buggy imple of {{MockDirectoryFactory.exists(...)}} this method assumes there's already an index, which is not true and eventually causes IndexWriter's constructor to complain about the lack of any segments. I've attached a patch which: * Beefs up the javadocs for {{DirectoryFactory.exists}} to clarify the expected semantics * Fixes {{EphemeralDirectoryFactory.exists}} to meet those expectations * Adds a {{TestDirectoryFactory}} that asserts those semantics for all known types of impls of DirectoryFactory I should point out there are several TODO comments in TestDirectoryFactory related to if we want to assert any specific semantics of how {{DirectoryFactory.exists}} should behave *before* a fyle is synced – because at the moment the behavior of impls like StandardDirectoryFactory disagree with the behavior of impls like NRTCachingDirectoryFactory that do caching/buffering in memory. Perhaps we should consider (in a new jira) tighing up those constraints, and "fixing" the caching based impls to return true if there are files on disk *OR* un-synced files in the cache? > FullSolrCloudDistribCmdsTest failures due to SolrCore initializating > incorrectly thinking index directory already exists? > - > > Key: SOLR-11988 > URL: https://issues.apache.org/jira/browse/SOLR-11988 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Hoss Man >Priority: Major > Attachments: SOLR-11988.patch, SOLR-11988_nocommit_logging.patch, > log.txt > > > There's been quite a few jenkins failures from FullSolrCloudDistribCmdsTest > that all seem to follow a similar pattern: > * Failure manifests as "Could not find collection:collection2" > * Failing seeds _frequently_ reproduce, but aren't guaranteed to > * Root cause can be traced back to the collection creation failing because > one of more replica cores failed due to the brand new (Solr)IndexWriter > expects to find an existing segments file > ** SolrCore should have already created an (empty) index in > {{SolrCore.initIndex(...)}} > ** The fact that the {{SolrIndexWrite}} throws this exception in it's > constructor suggests that the earlier call to {{SolrCore.initIndex(...)}} is > not functioning reliably > ** Based on some experimenting i've done, it seems like the underlying > problem is that in {{SolrCore.initIndex(...)}} the DirectoryFactory can "lie" > about wether a directory already exists. > More details to follow in comments. -- This message was sent by Atlassian JIRA (v7.6.3#76005) -
[jira] [Commented] (SOLR-11988) FullSolrCloudDistribCmdsTest failures due to SolrCore initializating incorrectly thinking index directory already exists?
[ https://issues.apache.org/jira/browse/SOLR-11988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16364983#comment-16364983 ] Hoss Man commented on SOLR-11988: - Note: with {{useFactory(null);}} added to {{FullSolrCloudDistribCmdsTest.beforeSuperClass()}} I can't get this seed to fail at all ... suggesting that MockDirectoryFactory is *almost certainly* the problem. > FullSolrCloudDistribCmdsTest failures due to SolrCore initializating > incorrectly thinking index directory already exists? > - > > Key: SOLR-11988 > URL: https://issues.apache.org/jira/browse/SOLR-11988 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Hoss Man >Priority: Major > Attachments: SOLR-11988_nocommit_logging.patch, log.txt > > > There's been quite a few jenkins failures from FullSolrCloudDistribCmdsTest > that all seem to follow a similar pattern: > * Failure manifests as "Could not find collection:collection2" > * Failing seeds _frequently_ reproduce, but aren't guaranteed to > * Root cause can be traced back to the collection creation failing because > one of more replica cores failed due to the brand new (Solr)IndexWriter > expects to find an existing segments file > ** SolrCore should have already created an (empty) index in > {{SolrCore.initIndex(...)}} > ** The fact that the {{SolrIndexWrite}} throws this exception in it's > constructor suggests that the earlier call to {{SolrCore.initIndex(...)}} is > not functioning reliably > ** Based on some experimenting i've done, it seems like the underlying > problem is that in {{SolrCore.initIndex(...)}} the DirectoryFactory can "lie" > about wether a directory already exists. > More details to follow in comments. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11988) FullSolrCloudDistribCmdsTest failures due to SolrCore initializating incorrectly thinking index directory already exists?
[ https://issues.apache.org/jira/browse/SOLR-11988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16364894#comment-16364894 ] Hoss Man commented on SOLR-11988: - Below is an example of the type of execption that gets logged when this test fails, from the log.txt file I just attached (where 2 _different_ replicas had this same problem). Note that this log.txt file was produced using the attached SOLR-11988_nocommit_logging.patch that increases the logging verbosity in {{SolrCore.initIndex(...)}} -- hence the "nocommit" log line, and the line numbers not matching up exactly with current master -- but the net result is the same: In {{SolrCore.initIndex(...)}} the DirectoryFactory claims that the index directory for this brand new, never before in existence SolrCore, already exists and doesn't need to be initialized. This then causes a problem when we try to open the "real" IndexWriter against it (using OpenMode.APPEND because we expect it to already exist)... {noformat} $ ant test -Dtestcase=FullSolrCloudDistribCmdsTest -Dtests.method=test -Dtests.seed=E6FD3BCDEA5D2094 -Dtests.slow=true -Dtests.locale=ar-JO -Dtests.timezone=Asia/Aqtobe -Dtests.asserts=true -Dtests.file.encoding=US-ASCII ... [junit4] 2> 33926 INFO (qtp1926432793-173) [n:127.0.0.1:60391_kg_fmt c:collection2 s:shard6 x:collection2_shard6_replica_n32] o.a.s.c.SolrCore [collection2_shard6_replica_n32] nocommit: skipping creation of '/home/hossman/lucene/dev/solr/build/solr-core/test/J0/../../../../../../../../../home/hossman/lucene/dev/solr/build/solr-core/test/J0/temp/solr.cloud.FullSolrCloudDistribCmdsTest_E6FD3BCDEA5D2094-001/shard-4-001/cores/collection2_shard6_replica_n32/data/index/' (aka: '/home/hossman/lucene/dev/solr/build/solr-core/test/J0/../../../../../../../../../home/hossman/lucene/dev/solr/build/solr-core/test/J0/temp/solr.cloud.FullSolrCloudDistribCmdsTest_E6FD3BCDEA5D2094-001/shard-4-001/cores/collection2_shard6_replica_n32/data/index') because dirFac (org.apache.solr.core.MockDirectoryFactory@768117c) says it exists ... [junit4] 2> 34763 ERROR (qtp1926432793-173) [n:127.0.0.1:60391_kg_fmt c:collection2 s:shard6 x:collection2_shard6_replica_n32] o.a.s.h.RequestHandlerBase org.apache.solr.common.SolrException: Error CREATEing SolrCore 'collection2_shard6_replica_n32': Unable to create core [collection2_shard6_replica_n32] Caused by: no segments* file found in LockValidatingDirectoryWrapper(MockDirectoryWrapper(RAMDirectory@63c826a3 lockFactory=org.apache.lucene.store.SingleInstanceLockFactory@17cdc1)): files: [] [junit4] 2>at org.apache.solr.core.CoreContainer.create(CoreContainer.java:993) [junit4] 2>at org.apache.solr.handler.admin.CoreAdminOperation.lambda$static$0(CoreAdminOperation.java:91) ... [junit4] 2> Caused by: org.apache.solr.common.SolrException: Unable to create core [collection2_shard6_replica_n32] [junit4] 2>at org.apache.solr.core.CoreContainer.createFromDescriptor(CoreContainer.java:1059) [junit4] 2>at org.apache.solr.core.CoreContainer.create(CoreContainer.java:954) [junit4] 2>... 39 more [junit4] 2> Caused by: org.apache.solr.common.SolrException: Error opening new searcher [junit4] 2>at org.apache.solr.core.SolrCore.(SolrCore.java:1013) [junit4] 2>at org.apache.solr.core.SolrCore.(SolrCore.java:868) [junit4] 2>at org.apache.solr.core.CoreContainer.createFromDescriptor(CoreContainer.java:1043) [junit4] 2>... 40 more [junit4] 2> Caused by: org.apache.solr.common.SolrException: Error opening new searcher [junit4] 2>at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:2100) [junit4] 2>at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:2220) [junit4] 2>at org.apache.solr.core.SolrCore.initSearcher(SolrCore.java:1096) [junit4] 2>at org.apache.solr.core.SolrCore.(SolrCore.java:985) [junit4] 2>... 42 more [junit4] 2> Caused by: org.apache.lucene.index.IndexNotFoundException: no segments* file found in LockValidatingDirectoryWrapper(MockDirectoryWrapper(RAMDirectory@63c826a3 lockFactory=org.apache.lucene.store.SingleInstanceLockFactory@17cdc1)): files: [] [junit4] 2>at org.apache.lucene.index.IndexWriter.(IndexWriter.java:1072) [junit4] 2>at org.apache.solr.update.SolrIndexWriter.(SolrIndexWriter.java:119) [junit4] 2>at org.apache.solr.update.SolrIndexWriter.create(SolrIndexWriter.java:94) [junit4] 2>at org.apache.solr.update.DefaultSolrCoreState.createMainIndexWriter(DefaultSolrCoreState.java:257) [junit4] 2>at org.apache.solr.update.DefaultSolrCoreState.getIndexWriter(DefaultSolrCoreState.java:131) [junit4] 2>at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:2061)