[jira] [Commented] (SOLR-11988) FullSolrCloudDistribCmdsTest failures due to SolrCore initializating incorrectly thinking index directory already exists?

2018-02-15 Thread Hoss Man (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-11988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16366437#comment-16366437
 ] 

Hoss Man commented on SOLR-11988:
-

Ok, the underlying problem seems to be:
 * SolrCore's usage of DirectoryFactory assumes particular semantics/meaning of 
how {{DirectoryFactory.exists(String)}} will behave
 ** Notably: that it returns {{true}} only if the directory exists "on disk" 
*AND* contains at least one file
 * EphemeralDirectoryFactory (and by extension MockDirectoryFactory) does not 
implement {{exists(String)}} in a manner consistent with that expectation:
 ** EphemeralDirectoryFactory's impl returns {{true}} anytime the specified 
path name has an entry in the cache – even if no files have been created in 
that directory

Notable if you have code like this
{code:java}
String path = ...; // some completley new never before seen directory name
DirectoryFactory dirFac = ...;
assertFalse(path + " should not exist yet", dirFac.exists(path));
Directory dir = dirFac.get(path, DirectoryFactory.DirContext.DEFAULT, 
DirectoryFactory.LOCK_TYPE_SINGLE);
assertFalse(path + " should still not exist", dirFac.exists(path));
{code}
...that second assert will fail for {{MockDirectoryFactory}}, but succeed for 
all other DirectoryFactory impls.

This problem can ultimately manifests itself with the previously mentioned 
{{IndexNotFoundException: no segments* file found in ...}} exception if the 
following situation arrises...
 # SolrCore starts initializing
 # A background thread (such as the monitors used for autoscaling) polls the 
metrics gauges for the SolrCore
 ** one of these gauges calls {{SolrCore.getIndexSize()}}
 ** This method calls {{directoryFactory.get(getIndexDir(),...)}} – which in 
the case of MockDirectoryFactory populates the "cache" for the 
{{getIndexDir()}} path
 # *After* some other thread has called {{SolrCore.getIndexSize()}}, then the 
core initialization thread makes it to {{SolrCore.initIndex(...)}}
 ** this calls {{getDirectoryFactory().exists(indexDir);}} to decide if we 
should attempt to initialize a new index
 ** Because of the buggy imple of {{MockDirectoryFactory.exists(...)}} this 
method assumes there's already an index, which is not true and eventually 
causes IndexWriter's constructor to complain about the lack of any segments.


I've attached a patch which:
 * Beefs up the javadocs for {{DirectoryFactory.exists}} to clarify the 
expected semantics
 * Fixes {{EphemeralDirectoryFactory.exists}} to meet those expectations
 * Adds a {{TestDirectoryFactory}} that asserts those semantics for all known 
types of impls of DirectoryFactory

I should point out there are several TODO comments in TestDirectoryFactory 
related to if we want to assert any specific semantics of how 
{{DirectoryFactory.exists}} should behave *before* a fyle is synced – because 
at the moment the behavior of impls like StandardDirectoryFactory disagree with 
the behavior of impls like NRTCachingDirectoryFactory that do caching/buffering 
in memory. Perhaps we should consider (in a new jira) tighing up those 
constraints, and "fixing" the caching based impls to return true if there are 
files on disk *OR* un-synced files in the cache?

> FullSolrCloudDistribCmdsTest failures due to SolrCore initializating 
> incorrectly thinking index directory already exists?
> -
>
> Key: SOLR-11988
> URL: https://issues.apache.org/jira/browse/SOLR-11988
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Hoss Man
>Priority: Major
> Attachments: SOLR-11988.patch, SOLR-11988_nocommit_logging.patch, 
> log.txt
>
>
> There's been quite a few jenkins failures from FullSolrCloudDistribCmdsTest 
> that all seem to follow a similar pattern:
>  * Failure manifests as "Could not find collection:collection2"
>  * Failing seeds _frequently_ reproduce, but aren't guaranteed to
>  * Root cause can be traced back to the collection creation failing because 
> one of more replica cores failed due to the brand new (Solr)IndexWriter 
> expects to find an existing segments file
>  ** SolrCore should have already created an (empty) index in 
> {{SolrCore.initIndex(...)}}
>  ** The fact that the {{SolrIndexWrite}} throws this exception in it's 
> constructor suggests that the earlier call to {{SolrCore.initIndex(...)}} is 
> not functioning reliably
>  ** Based on some experimenting i've done, it seems like the underlying 
> problem is that in {{SolrCore.initIndex(...)}} the DirectoryFactory can "lie" 
> about wether a directory already exists.
> More details to follow in comments.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-

[jira] [Commented] (SOLR-11988) FullSolrCloudDistribCmdsTest failures due to SolrCore initializating incorrectly thinking index directory already exists?

2018-02-14 Thread Hoss Man (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-11988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16364983#comment-16364983
 ] 

Hoss Man commented on SOLR-11988:
-

Note: with {{useFactory(null);}} added to 
{{FullSolrCloudDistribCmdsTest.beforeSuperClass()}} I can't get this seed to 
fail at all ... suggesting that MockDirectoryFactory is *almost certainly* the 
problem.



> FullSolrCloudDistribCmdsTest failures due to SolrCore initializating 
> incorrectly thinking index directory already exists?
> -
>
> Key: SOLR-11988
> URL: https://issues.apache.org/jira/browse/SOLR-11988
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Hoss Man
>Priority: Major
> Attachments: SOLR-11988_nocommit_logging.patch, log.txt
>
>
> There's been quite a few jenkins failures from FullSolrCloudDistribCmdsTest 
> that all seem to follow a similar pattern:
>  * Failure manifests as "Could not find collection:collection2"
>  * Failing seeds _frequently_ reproduce, but aren't guaranteed to
>  * Root cause can be traced back to the collection creation failing because 
> one of more replica cores failed due to the brand new (Solr)IndexWriter 
> expects to find an existing segments file
>  ** SolrCore should have already created an (empty) index in 
> {{SolrCore.initIndex(...)}}
>  ** The fact that the {{SolrIndexWrite}} throws this exception in it's 
> constructor suggests that the earlier call to {{SolrCore.initIndex(...)}} is 
> not functioning reliably
>  ** Based on some experimenting i've done, it seems like the underlying 
> problem is that in {{SolrCore.initIndex(...)}} the DirectoryFactory can "lie" 
> about wether a directory already exists.
> More details to follow in comments.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-11988) FullSolrCloudDistribCmdsTest failures due to SolrCore initializating incorrectly thinking index directory already exists?

2018-02-14 Thread Hoss Man (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-11988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16364894#comment-16364894
 ] 

Hoss Man commented on SOLR-11988:
-


Below is an example of the type of execption that gets logged when this test 
fails, from the log.txt file I just attached (where 2 _different_ replicas had 
this same problem).

Note that this log.txt file was produced using the attached 
SOLR-11988_nocommit_logging.patch that increases the logging verbosity in 
{{SolrCore.initIndex(...)}} -- hence the "nocommit" log line, and the line 
numbers not matching up exactly with current master -- but the net result is 
the same: In {{SolrCore.initIndex(...)}} the DirectoryFactory claims that the 
index directory for this brand new, never before in existence SolrCore, already 
exists and doesn't need to be initialized.  This then causes a problem when we 
try to open the "real" IndexWriter against it (using OpenMode.APPEND because we 
expect it to already exist)...

{noformat}
$ ant test  -Dtestcase=FullSolrCloudDistribCmdsTest -Dtests.method=test 
-Dtests.seed=E6FD3BCDEA5D2094 -Dtests.slow=true -Dtests.locale=ar-JO 
-Dtests.timezone=Asia/Aqtobe -Dtests.asserts=true -Dtests.file.encoding=US-ASCII
...
   [junit4]   2> 33926 INFO  (qtp1926432793-173) [n:127.0.0.1:60391_kg_fmt 
c:collection2 s:shard6  x:collection2_shard6_replica_n32] o.a.s.c.SolrCore 
[collection2_shard6_replica_n32] nocommit: skipping creation of 
'/home/hossman/lucene/dev/solr/build/solr-core/test/J0/../../../../../../../../../home/hossman/lucene/dev/solr/build/solr-core/test/J0/temp/solr.cloud.FullSolrCloudDistribCmdsTest_E6FD3BCDEA5D2094-001/shard-4-001/cores/collection2_shard6_replica_n32/data/index/'
 (aka: 
'/home/hossman/lucene/dev/solr/build/solr-core/test/J0/../../../../../../../../../home/hossman/lucene/dev/solr/build/solr-core/test/J0/temp/solr.cloud.FullSolrCloudDistribCmdsTest_E6FD3BCDEA5D2094-001/shard-4-001/cores/collection2_shard6_replica_n32/data/index')
 because dirFac (org.apache.solr.core.MockDirectoryFactory@768117c) says it 
exists
...
   [junit4]   2> 34763 ERROR (qtp1926432793-173) [n:127.0.0.1:60391_kg_fmt 
c:collection2 s:shard6  x:collection2_shard6_replica_n32] 
o.a.s.h.RequestHandlerBase org.apache.solr.common.SolrException: Error 
CREATEing SolrCore 'collection2_shard6_replica_n32': Unable to create core 
[collection2_shard6_replica_n32] Caused by: no segments* file found in 
LockValidatingDirectoryWrapper(MockDirectoryWrapper(RAMDirectory@63c826a3 
lockFactory=org.apache.lucene.store.SingleInstanceLockFactory@17cdc1)): files: 
[]
   [junit4]   2>at 
org.apache.solr.core.CoreContainer.create(CoreContainer.java:993)
   [junit4]   2>at 
org.apache.solr.handler.admin.CoreAdminOperation.lambda$static$0(CoreAdminOperation.java:91)
...
   [junit4]   2> Caused by: org.apache.solr.common.SolrException: Unable to 
create core [collection2_shard6_replica_n32]
   [junit4]   2>at 
org.apache.solr.core.CoreContainer.createFromDescriptor(CoreContainer.java:1059)
   [junit4]   2>at 
org.apache.solr.core.CoreContainer.create(CoreContainer.java:954)
   [junit4]   2>... 39 more
   [junit4]   2> Caused by: org.apache.solr.common.SolrException: Error opening 
new searcher
   [junit4]   2>at 
org.apache.solr.core.SolrCore.(SolrCore.java:1013)
   [junit4]   2>at 
org.apache.solr.core.SolrCore.(SolrCore.java:868)
   [junit4]   2>at 
org.apache.solr.core.CoreContainer.createFromDescriptor(CoreContainer.java:1043)
   [junit4]   2>... 40 more
   [junit4]   2> Caused by: org.apache.solr.common.SolrException: Error opening 
new searcher
   [junit4]   2>at 
org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:2100)
   [junit4]   2>at 
org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:2220)
   [junit4]   2>at 
org.apache.solr.core.SolrCore.initSearcher(SolrCore.java:1096)
   [junit4]   2>at 
org.apache.solr.core.SolrCore.(SolrCore.java:985)
   [junit4]   2>... 42 more
   [junit4]   2> Caused by: org.apache.lucene.index.IndexNotFoundException: no 
segments* file found in 
LockValidatingDirectoryWrapper(MockDirectoryWrapper(RAMDirectory@63c826a3 
lockFactory=org.apache.lucene.store.SingleInstanceLockFactory@17cdc1)): files: 
[]
   [junit4]   2>at 
org.apache.lucene.index.IndexWriter.(IndexWriter.java:1072)
   [junit4]   2>at 
org.apache.solr.update.SolrIndexWriter.(SolrIndexWriter.java:119)
   [junit4]   2>at 
org.apache.solr.update.SolrIndexWriter.create(SolrIndexWriter.java:94)
   [junit4]   2>at 
org.apache.solr.update.DefaultSolrCoreState.createMainIndexWriter(DefaultSolrCoreState.java:257)
   [junit4]   2>at 
org.apache.solr.update.DefaultSolrCoreState.getIndexWriter(DefaultSolrCoreState.java:131)
   [junit4]   2>at 
org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:2061)