[
https://issues.apache.org/jira/browse/SOLR-330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Grant Ingersoll updated SOLR-330:
---------------------------------
Attachment: SOLR-330.patch
First draft of a patch that updates the various TokenFilters, etc. in Solr to
use the new Lucene reuse API. Notes on implementation below:
Also, cleans up some of the javadocs in various files
Added Test for the Porter stemmer.
Cleaned up some string literals to be constants so that they can be safely
referred to in the tests.
In the PatternTokenFilter, it would be cool if there was a way to just operate
on the char array, but I don't see that the Pattern/Matcher API supports it.
Same goes for PhoneticTokenFilter
I'm not sure yet if the BufferedTokenStream can take advantage of reuse, so I
have left them alone for now, other than some minor doc fixes. I will think
about this some more.
In RemoveDuplicatesTF, I only converted to using termBuffer, not Token reuse.
I removed the "IN" and "OUT" loop labels, as I don't see what functionality
they provide.
Added ArraysUtils class and test to provide a bit more functionality than
Arrays.java offers in terms of comparing two char arrays. This could be
expanded at some point to cover other primitive comparisons.
My understanding of the new reusableTokenStream means we can't use it in the
SolrAnalyzer
On the TrimFilter, it is not clear to me that there would be a token that is
ever all whitespace. However, since the test handles it, I wonder why the a
Token of " ", when update offsets are on, reports the offsets as the end
and not the start. Just a minor nit, but it seems like the start/end offsets
should be 0, not the end of the token.
I'm not totally sure on the WordDelimiterFilter, as there is a fair amount of
new token creation, Also, I think, the newTok() method doesn't set the
position increment based on the original position increment, so I added that.
I'm also not completely sure how to handle FieldType DefaultAnalyzer.next().
It seems like it could reuse the token
Also not sure why the duplicate code for the MultiValueTokenStream in
HighlighterUtils and SolrHighlighter, so I left the highlighter TokenStreams
alone.
> Use new Lucene Token APIs (reuse and char[] buff)
> -------------------------------------------------
>
> Key: SOLR-330
> URL: https://issues.apache.org/jira/browse/SOLR-330
> Project: Solr
> Issue Type: Improvement
> Reporter: Yonik Seeley
> Assignee: Grant Ingersoll
> Priority: Minor
> Attachments: SOLR-330.patch
>
>
> Lucene is getting new Token APIs for better performance.
> - token reuse
> - char[] offset + len instead of String
> Requires a new version of lucene.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.