[ https://issues.apache.org/jira/browse/SOLR-1423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12755409#action_12755409 ]
Uwe Schindler commented on SOLR-1423: ------------------------------------- bq. I think the empty tokens is a bug and should be omitted in this patch. The Javadocs say, that it works like String.split() which return empty tokens, but strips empty tokens at the end of the string. This functionality is provided by Solr before and with this patch. The code would get simplier, if the Tokenizer would generally strip empty tokens, but it is a backwards break. I would tend to just commit and then open another issue. bq. Very nice! Can you open a separate ticket? Will open one about Lucene's BaseTokenStreamTestCase > Lucene 2.9 RC4 may need some changes in Solr Analyzers using CharStream & > others > -------------------------------------------------------------------------------- > > Key: SOLR-1423 > URL: https://issues.apache.org/jira/browse/SOLR-1423 > Project: Solr > Issue Type: Task > Components: Analysis > Affects Versions: 1.4 > Reporter: Uwe Schindler > Assignee: Koji Sekiguchi > Fix For: 1.4 > > Attachments: SOLR-1423-FieldType.patch, SOLR-1423.patch, > SOLR-1423.patch, SOLR-1423.patch > > > Because of some backwards compatibility problems (LUCENE-1906) we changed the > CharStream/CharFilter API a little bit. Tokenizer now only has a input field > of type java.io.Reader (as before the CharStream code). To correct offsets, > it is now needed to call the Tokenizer.correctOffset(int) method, which > delegates to the CharStream (if input is subclass of CharStream), else > returns an uncorrected offset. Normally it is enough to change all occurences > of input.correctOffset() to this.correctOffset() in Tokenizers. It should > also be checked, if custom Tokenizers in Solr do correct their offsets. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.