[jira] [Comment Edited] (SOLR-8292) TransactionLog.next() does not honor contract and return null for EOF
[ https://issues.apache.org/jira/browse/SOLR-8292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15806407#comment-15806407 ] Cao Manh Dat edited comment on SOLR-8292 at 1/7/17 1:59 AM: I think people kinda of misunderstanding this line {code} * @return The log record, or null if EOF {code} {{EOF}} here is not related to {{EOFException}}, {{EOF}} mean when the file is fully read to the end. While {{EOFException}} throw by TransactionLog.next() mean the file is corrupted. For example {code:title=TransactionLog.java|borderStyle=solid} codec.writeTag(JavaBinCodec.ARR, 3); codec.writeInt(UpdateLog.ADD | flags); // should just take one byte codec.writeLong(cmd.getVersion()); codec.writeSolrInputDocument(cmd.getSolrInputDocument()); {code} So when {{LogReader}} read to the tag {{JavaBinCodec.ARR = 3}}, it will expect that there are 3 more elements to be read. But if the file have only 2 elements ( because the file is corrupted/truncated ) it will throw an {{EOFException}}. FIY: I also write a test ( {{TestCloudRecovery.corruptedLogTest()}} ) to check that even if all the tlogs is corrupted/truncated, the collection can still become healthy after restart. So in my opinion SOLR-4116 is quite general, - if the system is restarted gracefully and the EOFException still be thrown so it's a bug. - If the system is restarted roughly ( kill -9 ) so it's not a bug. So we do not make sure that this is a bug or not without a way to re procedure that. was (Author: caomanhdat): I think people kinda of misunderstanding this line {code} * @return The log record, or null if EOF {code} {{EOF}} here is not related to {{EOFException}}, {{EOF}} mean when the file is fully read to the end. While {{EOFException}} throw by TransactionLog.next() mean the file is corrupted. For example {code:title=TransactionLog.java|borderStyle=solid} codec.writeTag(JavaBinCodec.ARR, 3); codec.writeInt(UpdateLog.ADD | flags); // should just take one byte codec.writeLong(cmd.getVersion()); codec.writeSolrInputDocument(cmd.getSolrInputDocument()); {code} So when {{LogReader}} read to the tag {{JavaBinCodec.ARR = 3}}, it will expect that there are 3 more elements to be read. But if the file have only 2 elements ( because the file is corrupted/truncated ) it will throw an {{EOFException}}. FIY: I also write a test ( {{TestCloudRecovery.corruptedLogTest()}} ) to check that even if all the tlogs is corrupted/truncated, the collection can still become healthy after restart. So in my opinion SOLR-4116 is quite general, - if the system is restarted gracefully and the EOFException still be thrown so it's a bug. - If the system is restarted roughly ( kill -9 ) so it's not a bug. > TransactionLog.next() does not honor contract and return null for EOF > - > > Key: SOLR-8292 > URL: https://issues.apache.org/jira/browse/SOLR-8292 > Project: Solr > Issue Type: Bug >Reporter: Erick Erickson >Assignee: Erick Erickson > Attachments: SOLR-8292.patch > > > This came to light in CDCR testing, which stresses this code a lot, there's a > stack trace showing this line (641 trunk) throwing an EOF exception: > o = codec.readVal(fis); > At first I thought to just wrap reading fis in a try/catch and return null, > but looking at the code a bit more I'm not so sure, that seems like it'd mask > what looks at first glance like a bug in the logic. > A few lines earlier (633-4) there's these lines: > // shouldn't currently happen - header and first record are currently written > at the same time > if (fis.position() >= fos.size()) { > Why are we comparing the the input file position against the size of the > output file? Maybe because the 'i' key is right next to the 'o' key? The > comment hints that it's checking for the ability to read the first record in > input stream along with the header. And perhaps there's a different issue > here because the expectation clearly is that the first record should be there > if the header is. > So what's the right thing to do? Wrap in a try/catch and return null for EOF? > Change the test? Do both? > I can take care of either, but wanted a clue whether the comparison of fis to > fos is intended. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (SOLR-8292) TransactionLog.next() does not honor contract and return null for EOF
[ https://issues.apache.org/jira/browse/SOLR-8292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15806407#comment-15806407 ] Cao Manh Dat edited comment on SOLR-8292 at 1/7/17 1:33 AM: I think people kinda of misunderstanding this line {code} * @return The log record, or null if EOF {code} {{EOF}} here is not related to {{EOFException}}, {{EOF}} mean when the file is fully read to the end. While {{EOFException}} throw by TransactionLog.next() mean the file is corrupted. For example {code:title=TransactionLog.java|borderStyle=solid} codec.writeTag(JavaBinCodec.ARR, 3); codec.writeInt(UpdateLog.ADD | flags); // should just take one byte codec.writeLong(cmd.getVersion()); codec.writeSolrInputDocument(cmd.getSolrInputDocument()); {code} So when {{LogReader}} read to the tag {{JavaBinCodec.ARR = 3}}, it will expect that there are 3 more elements to be read. But if the file have only 2 elements ( because the file is corrupted/truncated ) it will throw an {{EOFException}}. FIY: I also write a test ( {{TestCloudRecovery.corruptedLogTest()}} ) to check that even if all the tlogs is corrupted/truncated, the collection can still become healthy after restart. So in my opinion SOLR-4116 is quite general, - if the system is restarted gracefully and the EOFException still be thrown so it's a bug. - If the system is restarted roughly ( kill -9 ) so it's not a bug. was (Author: caomanhdat): I think people kinda of misunderstanding this line {code} * @return The log record, or null if EOF {code} {{EOF}} here is not related to {{EOFException}}, {{EOF}} mean when the file is fully read to the end. While {{EOFException}} throw by TransactionLog.next() mean the file is corrupted. For example {code:title=TransactionLog.java|borderStyle=solid} codec.writeTag(JavaBinCodec.ARR, 3); codec.writeInt(UpdateLog.ADD | flags); // should just take one byte codec.writeLong(cmd.getVersion()); codec.writeSolrInputDocument(cmd.getSolrInputDocument()); {code} So when {{LogReader}} read to the tag {{JavaBinCodec.ARR = 3}}, it will expect that there are 3 more elements to be read. But if the file have only 2 elements ( because the file is corrupted ) it will throw an {{EOFException}}. > TransactionLog.next() does not honor contract and return null for EOF > - > > Key: SOLR-8292 > URL: https://issues.apache.org/jira/browse/SOLR-8292 > Project: Solr > Issue Type: Bug >Reporter: Erick Erickson >Assignee: Erick Erickson > Attachments: SOLR-8292.patch > > > This came to light in CDCR testing, which stresses this code a lot, there's a > stack trace showing this line (641 trunk) throwing an EOF exception: > o = codec.readVal(fis); > At first I thought to just wrap reading fis in a try/catch and return null, > but looking at the code a bit more I'm not so sure, that seems like it'd mask > what looks at first glance like a bug in the logic. > A few lines earlier (633-4) there's these lines: > // shouldn't currently happen - header and first record are currently written > at the same time > if (fis.position() >= fos.size()) { > Why are we comparing the the input file position against the size of the > output file? Maybe because the 'i' key is right next to the 'o' key? The > comment hints that it's checking for the ability to read the first record in > input stream along with the header. And perhaps there's a different issue > here because the expectation clearly is that the first record should be there > if the header is. > So what's the right thing to do? Wrap in a try/catch and return null for EOF? > Change the test? Do both? > I can take care of either, but wanted a clue whether the comparison of fis to > fos is intended. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (SOLR-8292) TransactionLog.next() does not honor contract and return null for EOF
[ https://issues.apache.org/jira/browse/SOLR-8292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15803900#comment-15803900 ] Cao Manh Dat edited comment on SOLR-8292 at 1/6/17 9:19 AM: Hi Erick, I'm not familiar with CDCR code much. But I will give it a try today. Do we have any test that re procedure this error? Here are the log error on SOLR-4116, {code} 2012-11-28 11:32:33,086 WARN [solr.update.UpdateLog] - [recoveryExecutor-8-thread-1] - : Starting log replay tlog{file=/opt/solr/cores/openindex_e/data/tlog/tlog.028 refcount=2} active=false starting pos=0 {code} If your log also have flag {{active=false starting pos=0}} then I think both issues can be related. was (Author: caomanhdat): Hi Erick, I'm not familiar with CDCR code much. But I will give it a try today. Do we have any test that re procedure this error? > TransactionLog.next() does not honor contract and return null for EOF > - > > Key: SOLR-8292 > URL: https://issues.apache.org/jira/browse/SOLR-8292 > Project: Solr > Issue Type: Bug >Reporter: Erick Erickson >Assignee: Erick Erickson > Attachments: SOLR-8292.patch > > > This came to light in CDCR testing, which stresses this code a lot, there's a > stack trace showing this line (641 trunk) throwing an EOF exception: > o = codec.readVal(fis); > At first I thought to just wrap reading fis in a try/catch and return null, > but looking at the code a bit more I'm not so sure, that seems like it'd mask > what looks at first glance like a bug in the logic. > A few lines earlier (633-4) there's these lines: > // shouldn't currently happen - header and first record are currently written > at the same time > if (fis.position() >= fos.size()) { > Why are we comparing the the input file position against the size of the > output file? Maybe because the 'i' key is right next to the 'o' key? The > comment hints that it's checking for the ability to read the first record in > input stream along with the header. And perhaps there's a different issue > here because the expectation clearly is that the first record should be there > if the header is. > So what's the right thing to do? Wrap in a try/catch and return null for EOF? > Change the test? Do both? > I can take care of either, but wanted a clue whether the comparison of fis to > fos is intended. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (SOLR-8292) TransactionLog.next() does not honor contract and return null for EOF
[ https://issues.apache.org/jira/browse/SOLR-8292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15005479#comment-15005479 ] Erick Erickson edited comment on SOLR-8292 at 11/14/15 5:03 PM: Bah! fis is an InputStream. I'll try to add a bit of logging and trigger this again to see if the size comparison is even relevant. was (Author: erickerickson): Bah! fis is a FastInputStream, not FileInputStream. I'll try to add a bit of logging and trigger this again to see if the size comparison is even relevant. > TransactionLog.next() does not honor contract and return null for EOF > - > > Key: SOLR-8292 > URL: https://issues.apache.org/jira/browse/SOLR-8292 > Project: Solr > Issue Type: Bug >Reporter: Erick Erickson > > This came to light in CDCR testing, which stresses this code a lot, there's a > stack trace showing this line (641 trunk) throwing an EOF exception: > o = codec.readVal(fis); > At first I thought to just wrap reading fis in a try/catch and return null, > but looking at the code a bit more I'm not so sure, that seems like it'd mask > what looks at first glance like a bug in the logic. > A few lines earlier (633-4) there's these lines: > // shouldn't currently happen - header and first record are currently written > at the same time > if (fis.position() >= fos.size()) { > Why are we comparing the the input file position against the size of the > output file? Maybe because the 'i' key is right next to the 'o' key? The > comment hints that it's checking for the ability to read the first record in > input stream along with the header. And perhaps there's a different issue > here because the expectation clearly is that the first record should be there > if the header is. > So what's the right thing to do? Wrap in a try/catch and return null for EOF? > Change the test? Do both? > I can take care of either, but wanted a clue whether the comparison of fis to > fos is intended. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org