[jira] [Updated] (SOLR-10006) Cannot do a full sync (fetchindex) if the replica can't open a searcher

2017-02-13 Thread Erick Erickson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-10006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erick Erickson updated SOLR-10006:
--
Attachment: solr.log

Still fails, see the attached log for everything after I restarted the solr 
node that I had removed some index files from one of the cores on. This is on a 
fresh 6x pull in the last hour.

> Cannot do a full sync (fetchindex) if the replica can't open a searcher
> ---
>
> Key: SOLR-10006
> URL: https://issues.apache.org/jira/browse/SOLR-10006
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: 5.3.1, 6.4
>Reporter: Erick Erickson
> Attachments: SOLR-10006.patch, SOLR-10006.patch, solr.log, solr.log
>
>
> Doing a full sync or fetchindex requires an open searcher and if you can't 
> open the searcher those operations fail.
> For discussion. I've seen a situation in the field where a replica's index 
> became corrupt. When the node was restarted, the replica tried to do a full 
> sync but fails because the core can't open a searcher. The replica went into 
> an endless sync/fail/sync cycle.
> I couldn't reproduce that exact scenario, but it's easy enough to get into a 
> similar situation. Create a 2x2 collection and index some docs. Then stop one 
> of the instances and go in and remove a couple of segments files and restart.
> The replica stays in the "down" state, fine so far.
> Manually issue a fetchindex. That fails because the replica can't open a 
> searcher. Sure, issuing a fetchindex is abusive but I think it's the same 
> underlying issue: why should we care about the state of a replica's current 
> index when we're going to completely replace it anyway?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-10006) Cannot do a full sync (fetchindex) if the replica can't open a searcher

2017-01-25 Thread Mike Drob (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-10006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mike Drob updated SOLR-10006:
-
Attachment: SOLR-10006.patch

New patch that fixes your specific issue, however it probably still needs a 
little work.

First, we would probably want to catch EOF and FileNotFound in addition to 
NoSuchFile in IndexWriter.
Second, do we actually want to catch that at IndexWriter? There's a wide range 
of where we can catch and rethrow, and one could reasonably make an argument 
for any of them:

{noformat}
at 
org.apache.lucene.store.MMapDirectory.openInput(MMapDirectory.java:238)
at 
org.apache.lucene.store.NRTCachingDirectory.openInput(NRTCachingDirectory.java:192)
at 
org.apache.solr.core.MetricsDirectoryFactory$MetricsDirectory.openInput(MetricsDirectoryFactory.java:334)
at 
org.apache.lucene.codecs.lucene50.Lucene50PostingsReader.(Lucene50PostingsReader.java:81)
at 
org.apache.lucene.codecs.lucene50.Lucene50PostingsFormat.fieldsProducer(Lucene50PostingsFormat.java:442)
at 
org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsReader.(PerFieldPostingsFormat.java:292)
at 
org.apache.lucene.codecs.perfield.PerFieldPostingsFormat.fieldsProducer(PerFieldPostingsFormat.java:372)
at 
org.apache.lucene.index.SegmentCoreReaders.(SegmentCoreReaders.java:109)
at org.apache.lucene.index.SegmentReader.(SegmentReader.java:74)
at 
org.apache.lucene.index.ReadersAndUpdates.getReader(ReadersAndUpdates.java:143)
at 
org.apache.lucene.index.ReadersAndUpdates.getReadOnlyClone(ReadersAndUpdates.java:195)
at 
org.apache.lucene.index.StandardDirectoryReader.open(StandardDirectoryReader.java:103)
at org.apache.lucene.index.IndexWriter.getReader(IndexWriter.java:473)
at 
org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:103)
at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:79)
at 
org.apache.solr.core.StandardIndexReaderFactory.newReader(StandardIndexReaderFactory.java:39)
at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1958)
{noformat}

That might be better as a lucene discussion though?

> Cannot do a full sync (fetchindex) if the replica can't open a searcher
> ---
>
> Key: SOLR-10006
> URL: https://issues.apache.org/jira/browse/SOLR-10006
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: 5.3.1, 6.4
>Reporter: Erick Erickson
> Attachments: SOLR-10006.patch, SOLR-10006.patch, solr.log
>
>
> Doing a full sync or fetchindex requires an open searcher and if you can't 
> open the searcher those operations fail.
> For discussion. I've seen a situation in the field where a replica's index 
> became corrupt. When the node was restarted, the replica tried to do a full 
> sync but fails because the core can't open a searcher. The replica went into 
> an endless sync/fail/sync cycle.
> I couldn't reproduce that exact scenario, but it's easy enough to get into a 
> similar situation. Create a 2x2 collection and index some docs. Then stop one 
> of the instances and go in and remove a couple of segments files and restart.
> The replica stays in the "down" state, fine so far.
> Manually issue a fetchindex. That fails because the replica can't open a 
> searcher. Sure, issuing a fetchindex is abusive but I think it's the same 
> underlying issue: why should we care about the state of a replica's current 
> index when we're going to completely replace it anyway?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-10006) Cannot do a full sync (fetchindex) if the replica can't open a searcher

2017-01-25 Thread Erick Erickson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-10006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erick Erickson updated SOLR-10006:
--
Attachment: solr.log

Mike:

First of all thanks for looking. This is the full log file after starting, 
fresh trunk pull this AM.

Here's what I did to make this happen:
1> set up a 2x2 collection
2> indexed a bunch of docs. Stupid-simple indexing, just wanted to get more 
than one segment. I'm not sure having more than one segment is relevant 
actually
3> shut down a follower
4> removed a few of the segment files. Not an entire segment, just 3 files at 
random from a single segment. 
5> removed all the logs from the log directory.
6> tried to start the replica.

> Cannot do a full sync (fetchindex) if the replica can't open a searcher
> ---
>
> Key: SOLR-10006
> URL: https://issues.apache.org/jira/browse/SOLR-10006
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: 5.3.1, 6.4
>Reporter: Erick Erickson
> Attachments: SOLR-10006.patch, solr.log
>
>
> Doing a full sync or fetchindex requires an open searcher and if you can't 
> open the searcher those operations fail.
> For discussion. I've seen a situation in the field where a replica's index 
> became corrupt. When the node was restarted, the replica tried to do a full 
> sync but fails because the core can't open a searcher. The replica went into 
> an endless sync/fail/sync cycle.
> I couldn't reproduce that exact scenario, but it's easy enough to get into a 
> similar situation. Create a 2x2 collection and index some docs. Then stop one 
> of the instances and go in and remove a couple of segments files and restart.
> The replica stays in the "down" state, fine so far.
> Manually issue a fetchindex. That fails because the replica can't open a 
> searcher. Sure, issuing a fetchindex is abusive but I think it's the same 
> underlying issue: why should we care about the state of a replica's current 
> index when we're going to completely replace it anyway?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-10006) Cannot do a full sync (fetchindex) if the replica can't open a searcher

2017-01-20 Thread Mike Drob (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-10006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mike Drob updated SOLR-10006:
-
Attachment: SOLR-10006.patch

Patch that adds FNFE and NSFE to rethrow as CorruptIndexException.

There might be an argument to be made for pushing the try/catch down to the 
various implementations of {{SegmentInfoFormat::read}} but I don't think that 
will be maintainable going forward.

Another option is to catch all IOExceptions in {{SegmentInfos::readCommit}} but 
that's a pretty wide net to include and would mask and IndexTooOldExceptions, 
unless we specifically exclude them.

> Cannot do a full sync (fetchindex) if the replica can't open a searcher
> ---
>
> Key: SOLR-10006
> URL: https://issues.apache.org/jira/browse/SOLR-10006
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: 5.3.1, 6.4
>Reporter: Erick Erickson
> Attachments: SOLR-10006.patch
>
>
> Doing a full sync or fetchindex requires an open searcher and if you can't 
> open the searcher those operations fail.
> For discussion. I've seen a situation in the field where a replica's index 
> became corrupt. When the node was restarted, the replica tried to do a full 
> sync but fails because the core can't open a searcher. The replica went into 
> an endless sync/fail/sync cycle.
> I couldn't reproduce that exact scenario, but it's easy enough to get into a 
> similar situation. Create a 2x2 collection and index some docs. Then stop one 
> of the instances and go in and remove a couple of segments files and restart.
> The replica stays in the "down" state, fine so far.
> Manually issue a fetchindex. That fails because the replica can't open a 
> searcher. Sure, issuing a fetchindex is abusive but I think it's the same 
> underlying issue: why should we care about the state of a replica's current 
> index when we're going to completely replace it anyway?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org