[ 
https://issues.apache.org/jira/browse/SOLR-1423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated SOLR-1423:
--------------------------------

    Attachment: SOLR-1423.patch

This is a complete more effective rewrite of the whole Tokenizer (I would like 
to put this into, Lucene contrib, too!) using the new TokenStream API.

When going through the code, I realized the following: This Tokenizer can 
return empty tokens, it only filters enpty tokens in split() mode. Is this 
exspected? If empty tokens should be omitted, the if (matcher.find()) should be 
replaced by while (match.find()) with if match.length==0 continue; - The logic 
behind the strange omit empty token at the end  will get very simple after this 
change.

This patch removes the whole split()/group() methods from the factory as not 
needed anymore. If this is a backwards break, replace them by not used dummies 
(e.g. initialize a Tokenizer and return the token's TermText).

In my opinion, one should never index empty tokens...

A second thing: Lucene has a new BaseTokenStreamTest class for checking tokens 
without Token instances (which would no loger work, when Lucene 3.0 switches to 
Attributes only). Maybe you should update these test and use assertAnalyzesTo 
from the new base class instead.

> Lucene 2.9 RC4 may need some changes in Solr Analyzers using CharStream & 
> others
> --------------------------------------------------------------------------------
>
>                 Key: SOLR-1423
>                 URL: https://issues.apache.org/jira/browse/SOLR-1423
>             Project: Solr
>          Issue Type: Task
>          Components: Analysis
>    Affects Versions: 1.4
>            Reporter: Uwe Schindler
>            Assignee: Koji Sekiguchi
>             Fix For: 1.4
>
>         Attachments: SOLR-1423-FieldType.patch, SOLR-1423.patch, 
> SOLR-1423.patch
>
>
> Because of some backwards compatibility problems (LUCENE-1906) we changed the 
> CharStream/CharFilter API a little bit. Tokenizer now only has a input field 
> of type java.io.Reader (as before the CharStream code). To correct offsets, 
> it is now needed to call the Tokenizer.correctOffset(int) method, which 
> delegates to the CharStream (if input is subclass of CharStream), else 
> returns an uncorrected offset. Normally it is enough to change all occurences 
> of input.correctOffset() to this.correctOffset() in Tokenizers. It should 
> also be checked, if custom Tokenizers in Solr do correct their offsets.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to