[jira] [Comment Edited] (SOLR-8292) TransactionLog.next() does not honor contract and return null for EOF

2017-01-06 Thread Cao Manh Dat (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-8292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15806407#comment-15806407
 ] 

Cao Manh Dat edited comment on SOLR-8292 at 1/7/17 1:59 AM:


I think people kinda of misunderstanding this line
{code}
* @return The log record, or null if EOF
{code} 
{{EOF}} here is not related to {{EOFException}}, {{EOF}} mean when the file is 
fully read to the end. While {{EOFException}} throw by TransactionLog.next() 
mean the file is corrupted. 

For example 
{code:title=TransactionLog.java|borderStyle=solid}
codec.writeTag(JavaBinCodec.ARR, 3);
codec.writeInt(UpdateLog.ADD | flags);  // should just take one byte
codec.writeLong(cmd.getVersion());
codec.writeSolrInputDocument(cmd.getSolrInputDocument());
{code}

So when {{LogReader}} read to the tag {{JavaBinCodec.ARR = 3}}, it will expect 
that there are 3 more elements to be read. But if the file have only 2 elements 
( because the file is corrupted/truncated ) it will throw an {{EOFException}}.

FIY: I also write a test ( {{TestCloudRecovery.corruptedLogTest()}} ) to check 
that even if all the tlogs is corrupted/truncated, the collection can still 
become healthy after restart.

So in my opinion SOLR-4116 is quite general, 
- if the system is restarted gracefully and the EOFException still be thrown so 
it's a bug. 
- If the system is restarted roughly ( kill -9 ) so it's not a bug.

So we do not make sure that this is a bug or not without a way to re procedure 
that.


was (Author: caomanhdat):
I think people kinda of misunderstanding this line
{code}
* @return The log record, or null if EOF
{code} 
{{EOF}} here is not related to {{EOFException}}, {{EOF}} mean when the file is 
fully read to the end. While {{EOFException}} throw by TransactionLog.next() 
mean the file is corrupted. 

For example 
{code:title=TransactionLog.java|borderStyle=solid}
codec.writeTag(JavaBinCodec.ARR, 3);
codec.writeInt(UpdateLog.ADD | flags);  // should just take one byte
codec.writeLong(cmd.getVersion());
codec.writeSolrInputDocument(cmd.getSolrInputDocument());
{code}

So when {{LogReader}} read to the tag {{JavaBinCodec.ARR = 3}}, it will expect 
that there are 3 more elements to be read. But if the file have only 2 elements 
( because the file is corrupted/truncated ) it will throw an {{EOFException}}.

FIY: I also write a test ( {{TestCloudRecovery.corruptedLogTest()}} ) to check 
that even if all the tlogs is corrupted/truncated, the collection can still 
become healthy after restart.

So in my opinion SOLR-4116 is quite general, 
- if the system is restarted gracefully and the EOFException still be thrown so 
it's a bug. 
- If the system is restarted roughly ( kill -9 ) so it's not a bug.

> TransactionLog.next() does not honor contract and return null for EOF
> -
>
> Key: SOLR-8292
> URL: https://issues.apache.org/jira/browse/SOLR-8292
> Project: Solr
>  Issue Type: Bug
>Reporter: Erick Erickson
>Assignee: Erick Erickson
> Attachments: SOLR-8292.patch
>
>
> This came to light in CDCR testing, which stresses this code a lot, there's a 
> stack trace showing this line (641 trunk) throwing an EOF exception:
> o = codec.readVal(fis);
> At first I thought to just wrap reading fis in a try/catch and return null, 
> but looking at the code a bit more I'm not so sure, that seems like it'd mask 
> what looks at first glance like a bug in the logic.
> A few lines earlier (633-4) there's these lines:
> // shouldn't currently happen - header and first record are currently written 
> at the same time
> if (fis.position() >= fos.size()) {
> Why are we comparing the the input file position against the size of the 
> output file? Maybe because the 'i' key is right next to the 'o' key? The 
> comment hints that it's checking for the ability to read the first record in 
> input stream along with the header. And perhaps there's a different issue 
> here because the expectation clearly is that the first record should be there 
> if the header is.
> So what's the right thing to do? Wrap in a try/catch and return null for EOF? 
> Change the test? Do both?
> I can take care of either, but wanted a clue whether the comparison of fis to 
> fos is intended.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (SOLR-8292) TransactionLog.next() does not honor contract and return null for EOF

2017-01-06 Thread Cao Manh Dat (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-8292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15806407#comment-15806407
 ] 

Cao Manh Dat edited comment on SOLR-8292 at 1/7/17 1:33 AM:


I think people kinda of misunderstanding this line
{code}
* @return The log record, or null if EOF
{code} 
{{EOF}} here is not related to {{EOFException}}, {{EOF}} mean when the file is 
fully read to the end. While {{EOFException}} throw by TransactionLog.next() 
mean the file is corrupted. 

For example 
{code:title=TransactionLog.java|borderStyle=solid}
codec.writeTag(JavaBinCodec.ARR, 3);
codec.writeInt(UpdateLog.ADD | flags);  // should just take one byte
codec.writeLong(cmd.getVersion());
codec.writeSolrInputDocument(cmd.getSolrInputDocument());
{code}

So when {{LogReader}} read to the tag {{JavaBinCodec.ARR = 3}}, it will expect 
that there are 3 more elements to be read. But if the file have only 2 elements 
( because the file is corrupted/truncated ) it will throw an {{EOFException}}.

FIY: I also write a test ( {{TestCloudRecovery.corruptedLogTest()}} ) to check 
that even if all the tlogs is corrupted/truncated, the collection can still 
become healthy after restart.

So in my opinion SOLR-4116 is quite general, 
- if the system is restarted gracefully and the EOFException still be thrown so 
it's a bug. 
- If the system is restarted roughly ( kill -9 ) so it's not a bug.


was (Author: caomanhdat):
I think people kinda of misunderstanding this line
{code}
* @return The log record, or null if EOF
{code} 
{{EOF}} here is not related to {{EOFException}}, {{EOF}} mean when the file is 
fully read to the end. While {{EOFException}} throw by TransactionLog.next() 
mean the file is corrupted. 

For example 
{code:title=TransactionLog.java|borderStyle=solid}
codec.writeTag(JavaBinCodec.ARR, 3);
codec.writeInt(UpdateLog.ADD | flags);  // should just take one byte
codec.writeLong(cmd.getVersion());
codec.writeSolrInputDocument(cmd.getSolrInputDocument());
{code}

So when {{LogReader}} read to the tag {{JavaBinCodec.ARR = 3}}, it will expect 
that there are 3 more elements to be read. But if the file have only 2 elements 
( because the file is corrupted ) it will throw an {{EOFException}}.

> TransactionLog.next() does not honor contract and return null for EOF
> -
>
> Key: SOLR-8292
> URL: https://issues.apache.org/jira/browse/SOLR-8292
> Project: Solr
>  Issue Type: Bug
>Reporter: Erick Erickson
>Assignee: Erick Erickson
> Attachments: SOLR-8292.patch
>
>
> This came to light in CDCR testing, which stresses this code a lot, there's a 
> stack trace showing this line (641 trunk) throwing an EOF exception:
> o = codec.readVal(fis);
> At first I thought to just wrap reading fis in a try/catch and return null, 
> but looking at the code a bit more I'm not so sure, that seems like it'd mask 
> what looks at first glance like a bug in the logic.
> A few lines earlier (633-4) there's these lines:
> // shouldn't currently happen - header and first record are currently written 
> at the same time
> if (fis.position() >= fos.size()) {
> Why are we comparing the the input file position against the size of the 
> output file? Maybe because the 'i' key is right next to the 'o' key? The 
> comment hints that it's checking for the ability to read the first record in 
> input stream along with the header. And perhaps there's a different issue 
> here because the expectation clearly is that the first record should be there 
> if the header is.
> So what's the right thing to do? Wrap in a try/catch and return null for EOF? 
> Change the test? Do both?
> I can take care of either, but wanted a clue whether the comparison of fis to 
> fos is intended.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (SOLR-8292) TransactionLog.next() does not honor contract and return null for EOF

2017-01-06 Thread Cao Manh Dat (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-8292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15803900#comment-15803900
 ] 

Cao Manh Dat edited comment on SOLR-8292 at 1/6/17 9:19 AM:


Hi Erick, I'm not familiar with CDCR code much. But I will give it a try today. 
Do we have any test that re procedure this error?

Here are the log error on SOLR-4116, 
{code}
2012-11-28 11:32:33,086 WARN [solr.update.UpdateLog] - 
[recoveryExecutor-8-thread-1] - : Starting log replay 
tlog{file=/opt/solr/cores/openindex_e/data/tlog/tlog.028 
refcount=2} active=false starting pos=0
{code}
If your log also have flag {{active=false starting pos=0}} then I think both 
issues can be related.


was (Author: caomanhdat):
Hi Erick, I'm not familiar with CDCR code much. But I will give it a try today. 
Do we have any test that re procedure this error?

> TransactionLog.next() does not honor contract and return null for EOF
> -
>
> Key: SOLR-8292
> URL: https://issues.apache.org/jira/browse/SOLR-8292
> Project: Solr
>  Issue Type: Bug
>Reporter: Erick Erickson
>Assignee: Erick Erickson
> Attachments: SOLR-8292.patch
>
>
> This came to light in CDCR testing, which stresses this code a lot, there's a 
> stack trace showing this line (641 trunk) throwing an EOF exception:
> o = codec.readVal(fis);
> At first I thought to just wrap reading fis in a try/catch and return null, 
> but looking at the code a bit more I'm not so sure, that seems like it'd mask 
> what looks at first glance like a bug in the logic.
> A few lines earlier (633-4) there's these lines:
> // shouldn't currently happen - header and first record are currently written 
> at the same time
> if (fis.position() >= fos.size()) {
> Why are we comparing the the input file position against the size of the 
> output file? Maybe because the 'i' key is right next to the 'o' key? The 
> comment hints that it's checking for the ability to read the first record in 
> input stream along with the header. And perhaps there's a different issue 
> here because the expectation clearly is that the first record should be there 
> if the header is.
> So what's the right thing to do? Wrap in a try/catch and return null for EOF? 
> Change the test? Do both?
> I can take care of either, but wanted a clue whether the comparison of fis to 
> fos is intended.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (SOLR-8292) TransactionLog.next() does not honor contract and return null for EOF

2015-11-14 Thread Erick Erickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-8292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15005479#comment-15005479
 ] 

Erick Erickson edited comment on SOLR-8292 at 11/14/15 5:03 PM:


Bah! fis is an InputStream. I'll try to add a bit of logging and trigger this 
again to see if the size comparison is even relevant.


was (Author: erickerickson):
Bah! fis is a FastInputStream, not FileInputStream. I'll try to add a bit of 
logging and trigger this again to see if the size comparison is even relevant.

> TransactionLog.next() does not honor contract and return null for EOF
> -
>
> Key: SOLR-8292
> URL: https://issues.apache.org/jira/browse/SOLR-8292
> Project: Solr
>  Issue Type: Bug
>Reporter: Erick Erickson
>
> This came to light in CDCR testing, which stresses this code a lot, there's a 
> stack trace showing this line (641 trunk) throwing an EOF exception:
> o = codec.readVal(fis);
> At first I thought to just wrap reading fis in a try/catch and return null, 
> but looking at the code a bit more I'm not so sure, that seems like it'd mask 
> what looks at first glance like a bug in the logic.
> A few lines earlier (633-4) there's these lines:
> // shouldn't currently happen - header and first record are currently written 
> at the same time
> if (fis.position() >= fos.size()) {
> Why are we comparing the the input file position against the size of the 
> output file? Maybe because the 'i' key is right next to the 'o' key? The 
> comment hints that it's checking for the ability to read the first record in 
> input stream along with the header. And perhaps there's a different issue 
> here because the expectation clearly is that the first record should be there 
> if the header is.
> So what's the right thing to do? Wrap in a try/catch and return null for EOF? 
> Change the test? Do both?
> I can take care of either, but wanted a clue whether the comparison of fis to 
> fos is intended.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org