[jira] Updated: (SOLR-330) Use new Lucene Token APIs (reuse and char[] buff)

Grant Ingersoll (JIRA) Fri, 01 Feb 2008 08:20:43 -0800

     [ 
https://issues.apache.org/jira/browse/SOLR-330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Grant Ingersoll updated SOLR-330:
---------------------------------

    Attachment: SOLR-330.patch

First draft of a patch that updates the various TokenFilters, etc. in Solr to 
use the new Lucene reuse API.  Notes on implementation below:

Also, cleans up some of the javadocs in various files

Added Test for the Porter stemmer.

Cleaned up some string literals to be constants so that they can be safely 
referred to in the tests.

In the PatternTokenFilter, it would be cool if there was a way to just operate 
on the char array, but I don't see that the Pattern/Matcher API supports it.

Same goes for PhoneticTokenFilter

I'm not sure yet if the BufferedTokenStream can take advantage of reuse, so I 
have left them alone for now, other than some minor doc fixes.  I will think 
about this some more.

In RemoveDuplicatesTF, I only converted to using termBuffer, not Token reuse.   
I removed the "IN" and "OUT" loop labels, as I don't see what functionality 
they provide.

Added ArraysUtils class and test to provide a bit more functionality than 
Arrays.java offers in terms of comparing two char arrays.  This could be 
expanded at some point to cover other primitive comparisons.

My understanding of the new reusableTokenStream means we can't use it in the 
SolrAnalyzer

On the TrimFilter, it is not clear to me that there would be a token that is 
ever all whitespace.  However, since the test handles it, I wonder why the a 
Token of "        ", when update offsets are on, reports the offsets as the end 
and not the start.  Just a minor nit, but it seems like the start/end offsets 
should be 0, not the end of the token.

I'm not totally sure on the WordDelimiterFilter, as there is a fair amount of 
new token creation,  Also, I think, the newTok() method doesn't set the 
position increment based on the original position increment, so I added that.

 I'm also not completely sure how to handle FieldType DefaultAnalyzer.next().  
It seems like it could reuse the token

Also not sure why the duplicate code for the MultiValueTokenStream in 
HighlighterUtils and SolrHighlighter, so I left the highlighter TokenStreams 
alone.

> Use new Lucene Token APIs (reuse and char[] buff)
> -------------------------------------------------
>
>                 Key: SOLR-330
>                 URL: https://issues.apache.org/jira/browse/SOLR-330
>             Project: Solr
>          Issue Type: Improvement
>            Reporter: Yonik Seeley
>            Assignee: Grant Ingersoll
>            Priority: Minor
>         Attachments: SOLR-330.patch
>
>
> Lucene is getting new Token APIs for better performance.
> - token reuse
> - char[] offset + len instead of String
> Requires a new version of lucene.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-330) Use new Lucene Token APIs (reuse and char[] buff)

Reply via email to