Re: Payloads, Tokenizers, and Filters. Oh My!

Tricia Williams Sun, 18 Nov 2007 14:11:23 -0800

I apologize for cross-posting but I believe both Solr and Lucene usersand developers should be concerned with this. I am not aware of abetter way to reach both communities.


In this email I'm looking for comments on:


   * Do TokenFilters belong in the Solr code base at all?
   * How to deal with TokenFilters that add new Tokens to the stream?
   * How to patch TokenFilters and Tokenizers using the model of
     LUCENE-969 in the Solr code base and in Lucene contrib?

Earlier in this thread I identified that at least one TokenFilter iseating Payloads (WordDelimiterFilter).


Yonik pointed out:

Yes, this will be an issue for many custom tokenizers that don't yet
know about payloads but that create tokens.  It's not clear what to do
in some cases when multiple tokens are created from one... should
identical payloads be created for the new tokens... it depends on what
the semantics of those payloads are.

And I responded:

I suppose that it is only fair to take this on a case by case basis.Maybe we will have to write new TokenFilters for each Tokenzier thatuses Payloads (but I sure hope not!). Maybe we can build someoptional configuration options into the TokenFilter constructor thatguide their behavior with regard to Payloads. Maybe there issomething stored in the TokenStream that dictates how the Payloads arehandled by the TokenFilters. Maybe there is no case where identicalpayloads would not be created for new tokens and we can just changethe TokenFilter to deal with payloads directly in a uniform way.

I thought it might be useful to figure out which existing TokenFiltersneed to know about Payloads. To this end I have taken an inventory ofthe TokenFilters out there. I think it is fair to categorize them byAdd (A), Delete (D), Modify (M), Observe (O):


*org.apache.solr.analysis.*HyphenatedWordsFilter, DM
*org.apache.solr.analysis.*KeepWordFilter, D
*org.apache.solr.analysis.*LengthFilter, D
*org.apache.solr.analysis.*PatternReplaceFilter, M
*org.apache.solr.analysis.*PhoneticFilter, AM
*org.apache.solr.analysis.*RemoveDuplicatesTokenFilter, D
*org.apache.solr.analysis.*SynonymFilter, ADM
*org.apache.solr.analysis.*TrimFilter, M
*org.apache.solr.analysis.*WordDelimiterFilter, AM
*org.apache.lucene.analysis.*CachingTokenFilter, O
*org.apache.lucene.analysis.*ISOLatin1AccentFilter, M
*org.apache.lucene.analysis.*LengthFilter, D
*org.apache.lucene.analysis.*LowerCaseFilter, M
*org.apache.lucene.analysis.*PorterStemFilter, M
*org.apache.lucene.analysis.*StopFilter, D
*org.apache.lucene.analysis.standard*.StandardFilter, M*
org.apache.lucene.analysis.br.*BrazilianStemFilter, M
*org.apache.lucene.analysis.cn.*ChineseFilter, D*
org.apache.lucene.analysis.de.*GermanStemFilter, M
*org.apache.lucene.analysis.el.*GreekLowerCaseFilter, M
*org.apache.lucene.analysis.fr.*ElisionFilter, M
*org.apache.lucene.analysis.fr.*FrenchStemFilter, M
*org.apache.lucene.analysis.ngram.*EdgeNGramTokenFilter, AM
*org.apache.lucene.analysis.ngram.*NGramTokenFilter, AM
*org.apache.lucene.analysis.nl.*DutchStemFilter, M
*org.apache.lucene.analysis.ru.*RussianLowerCaseFilter, M
*org.apache.lucene.analysis.ru.*RussianStemFilter, M
*org.apache.lucene.analysis.th.*ThaiWordFilter, AM
*org.apache.lucene.analysis.snowball.*SnowballFilter, M

Some characteristics of Add (A), Delete (D), Modify (M), Observe (O)

Add: new Token() and buffer of Tokens to consider before addressinginput.next()

Delete: loop ignoring tokens based on some criteria
Modify: new Token(), or use of Token set methods
Observe: rare CachingTokenFilter

The categories of TokenFilters that are affected by Payloads are add andmodify. The default behavior of TokenFilters which only delete orobserve return the Token fed through intact, hence the Payload willremain intact.

Maybe the Lucene community has thought about this problem? I noticedthat the org.apache.lucene.analysis TokenFilters in the modify category(there are none in the add category) refrain from using new Token().That led me to the comment in the JavaDocs:

*NOTE:* As of 2.3, Token stores the term text internally as amalleable char[] termBuffer instead of String termText. The indexingcode and core tokenizers have been changed re-use a single Tokeninstance, changing its buffer and other fields in-place as the Tokenis processed. This provides substantially better indexing performanceas it saves the GC cost of new'ing a Token and String for every term.The APIs that accept String termText are still available but a warningabout the associated performance cost has been added (below). The|termText()|<http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/javadoc/org/apache/lucene/analysis/Token.html#termText%28%29>method has been deprecated.
Tokenizers and filters should try to re-use a Token instance whenpossible for best performance, by implementing the|TokenStream.next(Token)|<http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/javadoc/org/apache/lucene/analysis/TokenStream.html#next%28org.apache.lucene.analysis.Token%29>API. Failing that, to create a new Token you should first use one ofthe constructors that starts with null text. Then you should calleither |termBuffer()|<http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/javadoc/org/apache/lucene/analysis/Token.html#termBuffer%28%29>or |resizeTermBuffer(int)|<http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/javadoc/org/apache/lucene/analysis/Token.html#resizeTermBuffer%28int%29>to retrieve the Token's termBuffer. Fill in the characters of yourterm into this buffer, and finally call |setTermLength(int)|<http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/javadoc/org/apache/lucene/analysis/Token.html#setTermLength%28int%29>to set the length of the term text. See LUCENE-969<https://issues.apache.org/jira/browse/LUCENE-969> for details.

The patch mentioned modifies the Tokenizers and TokenFilters in theLucene core code base to abide by the suggestions made. This would meanthat the TokenFilters in my modify category would have the defaultbehavior of the Payload of the modified Token remaining intact. I wouldargue that when/if the Solr community starts using Lucene 2.3 that asimilar patch should be created for the TokenFilters there but I wonderif the TokenFilters belong in Solr's domain at all. At some point theTokenFilters and Tokenizers in the contrib sections of Lucene shouldalso be patched with the suggestions.

If this occurs then we only have to consider the add case. I don'tthink we can avoid looking at this on a case by case basis, but most ofthe add cases are providing alternate terms for the same position. Inthat case the payload would simply be copied to the new Token much likethe Token's positionIncrement.


Thanks for your input,
Tricia

Re: Payloads, Tokenizers, and Filters. Oh My!

Reply via email to