[jira] [Updated] (SOLR-10006) Cannot do a full sync (fetchindex) if the replica can't open a searcher
[ https://issues.apache.org/jira/browse/SOLR-10006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Erick Erickson updated SOLR-10006: -- Attachment: solr.log Still fails, see the attached log for everything after I restarted the solr node that I had removed some index files from one of the cores on. This is on a fresh 6x pull in the last hour. > Cannot do a full sync (fetchindex) if the replica can't open a searcher > --- > > Key: SOLR-10006 > URL: https://issues.apache.org/jira/browse/SOLR-10006 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) >Affects Versions: 5.3.1, 6.4 >Reporter: Erick Erickson > Attachments: SOLR-10006.patch, SOLR-10006.patch, solr.log, solr.log > > > Doing a full sync or fetchindex requires an open searcher and if you can't > open the searcher those operations fail. > For discussion. I've seen a situation in the field where a replica's index > became corrupt. When the node was restarted, the replica tried to do a full > sync but fails because the core can't open a searcher. The replica went into > an endless sync/fail/sync cycle. > I couldn't reproduce that exact scenario, but it's easy enough to get into a > similar situation. Create a 2x2 collection and index some docs. Then stop one > of the instances and go in and remove a couple of segments files and restart. > The replica stays in the "down" state, fine so far. > Manually issue a fetchindex. That fails because the replica can't open a > searcher. Sure, issuing a fetchindex is abusive but I think it's the same > underlying issue: why should we care about the state of a replica's current > index when we're going to completely replace it anyway? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-10006) Cannot do a full sync (fetchindex) if the replica can't open a searcher
[ https://issues.apache.org/jira/browse/SOLR-10006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mike Drob updated SOLR-10006: - Attachment: SOLR-10006.patch New patch that fixes your specific issue, however it probably still needs a little work. First, we would probably want to catch EOF and FileNotFound in addition to NoSuchFile in IndexWriter. Second, do we actually want to catch that at IndexWriter? There's a wide range of where we can catch and rethrow, and one could reasonably make an argument for any of them: {noformat} at org.apache.lucene.store.MMapDirectory.openInput(MMapDirectory.java:238) at org.apache.lucene.store.NRTCachingDirectory.openInput(NRTCachingDirectory.java:192) at org.apache.solr.core.MetricsDirectoryFactory$MetricsDirectory.openInput(MetricsDirectoryFactory.java:334) at org.apache.lucene.codecs.lucene50.Lucene50PostingsReader.(Lucene50PostingsReader.java:81) at org.apache.lucene.codecs.lucene50.Lucene50PostingsFormat.fieldsProducer(Lucene50PostingsFormat.java:442) at org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsReader.(PerFieldPostingsFormat.java:292) at org.apache.lucene.codecs.perfield.PerFieldPostingsFormat.fieldsProducer(PerFieldPostingsFormat.java:372) at org.apache.lucene.index.SegmentCoreReaders.(SegmentCoreReaders.java:109) at org.apache.lucene.index.SegmentReader.(SegmentReader.java:74) at org.apache.lucene.index.ReadersAndUpdates.getReader(ReadersAndUpdates.java:143) at org.apache.lucene.index.ReadersAndUpdates.getReadOnlyClone(ReadersAndUpdates.java:195) at org.apache.lucene.index.StandardDirectoryReader.open(StandardDirectoryReader.java:103) at org.apache.lucene.index.IndexWriter.getReader(IndexWriter.java:473) at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:103) at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:79) at org.apache.solr.core.StandardIndexReaderFactory.newReader(StandardIndexReaderFactory.java:39) at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1958) {noformat} That might be better as a lucene discussion though? > Cannot do a full sync (fetchindex) if the replica can't open a searcher > --- > > Key: SOLR-10006 > URL: https://issues.apache.org/jira/browse/SOLR-10006 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) >Affects Versions: 5.3.1, 6.4 >Reporter: Erick Erickson > Attachments: SOLR-10006.patch, SOLR-10006.patch, solr.log > > > Doing a full sync or fetchindex requires an open searcher and if you can't > open the searcher those operations fail. > For discussion. I've seen a situation in the field where a replica's index > became corrupt. When the node was restarted, the replica tried to do a full > sync but fails because the core can't open a searcher. The replica went into > an endless sync/fail/sync cycle. > I couldn't reproduce that exact scenario, but it's easy enough to get into a > similar situation. Create a 2x2 collection and index some docs. Then stop one > of the instances and go in and remove a couple of segments files and restart. > The replica stays in the "down" state, fine so far. > Manually issue a fetchindex. That fails because the replica can't open a > searcher. Sure, issuing a fetchindex is abusive but I think it's the same > underlying issue: why should we care about the state of a replica's current > index when we're going to completely replace it anyway? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-10006) Cannot do a full sync (fetchindex) if the replica can't open a searcher
[ https://issues.apache.org/jira/browse/SOLR-10006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Erick Erickson updated SOLR-10006: -- Attachment: solr.log Mike: First of all thanks for looking. This is the full log file after starting, fresh trunk pull this AM. Here's what I did to make this happen: 1> set up a 2x2 collection 2> indexed a bunch of docs. Stupid-simple indexing, just wanted to get more than one segment. I'm not sure having more than one segment is relevant actually 3> shut down a follower 4> removed a few of the segment files. Not an entire segment, just 3 files at random from a single segment. 5> removed all the logs from the log directory. 6> tried to start the replica. > Cannot do a full sync (fetchindex) if the replica can't open a searcher > --- > > Key: SOLR-10006 > URL: https://issues.apache.org/jira/browse/SOLR-10006 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) >Affects Versions: 5.3.1, 6.4 >Reporter: Erick Erickson > Attachments: SOLR-10006.patch, solr.log > > > Doing a full sync or fetchindex requires an open searcher and if you can't > open the searcher those operations fail. > For discussion. I've seen a situation in the field where a replica's index > became corrupt. When the node was restarted, the replica tried to do a full > sync but fails because the core can't open a searcher. The replica went into > an endless sync/fail/sync cycle. > I couldn't reproduce that exact scenario, but it's easy enough to get into a > similar situation. Create a 2x2 collection and index some docs. Then stop one > of the instances and go in and remove a couple of segments files and restart. > The replica stays in the "down" state, fine so far. > Manually issue a fetchindex. That fails because the replica can't open a > searcher. Sure, issuing a fetchindex is abusive but I think it's the same > underlying issue: why should we care about the state of a replica's current > index when we're going to completely replace it anyway? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-10006) Cannot do a full sync (fetchindex) if the replica can't open a searcher
[ https://issues.apache.org/jira/browse/SOLR-10006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mike Drob updated SOLR-10006: - Attachment: SOLR-10006.patch Patch that adds FNFE and NSFE to rethrow as CorruptIndexException. There might be an argument to be made for pushing the try/catch down to the various implementations of {{SegmentInfoFormat::read}} but I don't think that will be maintainable going forward. Another option is to catch all IOExceptions in {{SegmentInfos::readCommit}} but that's a pretty wide net to include and would mask and IndexTooOldExceptions, unless we specifically exclude them. > Cannot do a full sync (fetchindex) if the replica can't open a searcher > --- > > Key: SOLR-10006 > URL: https://issues.apache.org/jira/browse/SOLR-10006 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) >Affects Versions: 5.3.1, 6.4 >Reporter: Erick Erickson > Attachments: SOLR-10006.patch > > > Doing a full sync or fetchindex requires an open searcher and if you can't > open the searcher those operations fail. > For discussion. I've seen a situation in the field where a replica's index > became corrupt. When the node was restarted, the replica tried to do a full > sync but fails because the core can't open a searcher. The replica went into > an endless sync/fail/sync cycle. > I couldn't reproduce that exact scenario, but it's easy enough to get into a > similar situation. Create a 2x2 collection and index some docs. Then stop one > of the instances and go in and remove a couple of segments files and restart. > The replica stays in the "down" state, fine so far. > Manually issue a fetchindex. That fails because the replica can't open a > searcher. Sure, issuing a fetchindex is abusive but I think it's the same > underlying issue: why should we care about the state of a replica's current > index when we're going to completely replace it anyway? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org