[jira] Commented: (SOLR-1662) BufferedTokenStream incorrect cloning

Uwe Schindler (JIRA) Wed, 16 Dec 2009 06:05:46 -0800

    [ 
https://issues.apache.org/jira/browse/SOLR-1662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12791370#action_12791370
 ]


Uwe Schindler commented on SOLR-1662:
-------------------------------------

Just the short desription from the API side in Lucene:
Lucene's documentation of TokenStream.next() says: "The returned Token is a 
"full private copy" (not re-used across calls to next())". 
AB_AAB_Stream.process() duplicates the token by just putting it uncloned into 
the outQueue. As the consumer of the BufferedTokenStream assumes that the Token 
is private it is allowed to change it - and by that it also changes the token 
in the outQueue. If you e.g. put another TokenFilter in fromt of this 
AB_AAB_Stream, and modify the token there it would break.
In my opinion, the responsibility to clone is in AB_AAB_Stream, 
BufferedTokenStream will never return the same token twice by itsself. So its a 
bug in the test. But Robert told me that e.g. RemoveDuplicates has a similar 
problem.
The general contract for writing such streams is: whenever you return a Token 
from next(), never put it somewhere else uncloned, because the caller can 
change it.

The fix is to do: write((Token) t.clone());

> BufferedTokenStream incorrect cloning
> -------------------------------------
>
>                 Key: SOLR-1662
>                 URL: https://issues.apache.org/jira/browse/SOLR-1662
>             Project: Solr
>          Issue Type: Bug
>          Components: Schema and Analysis
>    Affects Versions: 1.4
>            Reporter: Robert Muir
>
> As part of writing tests for SOLR-1657, I rewrote one of the base classes 
> (BaseTokenTestCase) to use the new TokenStream API, but also with some 
> additional safety.
> {code}
>  public static String tsToString(TokenStream in) throws IOException {
>     StringBuilder out = new StringBuilder();
>     TermAttribute termAtt = (TermAttribute) 
> in.addAttribute(TermAttribute.class);
>     // extra safety to enforce, that the state is not preserved and also
>     // assign bogus values
>     in.clearAttributes();
>     termAtt.setTermBuffer("bogusTerm");
>     while (in.incrementToken()) {
>       if (out.length() > 0)
>         out.append(' ');
>       out.append(termAtt.term());
>       in.clearAttributes();
>       termAtt.setTermBuffer("bogusTerm");
>     }
>     in.close();
>     return out.toString();
>   }
> {code}
> Setting the term text to bogus values helps find bugs in tokenstreams that do 
> not clear or clone properly. In this case there is a problem with a 
> tokenstream AB_AAB_Stream in TestBufferedTokenStream, it converts A B -> A A 
> B but does not clone, so the values get overwritten.
> This can be fixed in two ways: 
> * BufferedTokenStream does the cloning
> * subclasses are responsible for the cloning
> The question is which one should it be?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1662) BufferedTokenStream incorrect cloning

Reply via email to