subject:"Payloads, Tokenizers, and Filters. Oh My\!"

Re: Payloads, Tokenizers, and Filters. Oh My!

2007-11-20 Thread Chris Hostetter


: I apologize for cross-posting but  I believe both Solr and Lucene users and
: developers should be concerned with this.  I am not aware of a better way to
: reach both communities.

some of these questions strike me as being largely unrelated.  if 
anyone wishes to followup on them further, let's do it in (new) seperate 
threads for each topic, on the specific list appropriate to the topic...

:* Do TokenFilters belong in the Solr code base at all?

Yes, in so much as any java code belongs in the Solr code base (or the 
nutch code base for that matter).  They are seperate projects with 
seperate communities and seperate needs -- that doesn't mean that there 
isn't code in Solr which could be useful to the broader community of 
lucene-java; in that case the appropriate course of action is to open a 
LUCENE issue to promote the code up into lucene-java, and a dependent 
issue in SOLR to deprecate the current code and use the newer code 
instead.

as some people may be aware, there was a discussion aboutthis sort of 
thing at ApacheCon during the Lucene BOF -- some reasons this doesn't 
happen as often as it seems like it should are:
  * the code may have subtle dependency tendrals that make it hard to 
refactor from one code base to the other.
  * the tests are frequently harder to promote then the code (in the 
case of most Solr tests that use the TestHarness, it's probably easier 
to write new tests from scratch)
  * when promoting the code, it's the best time to consider wether the 
existing API is really the best API before a lot of new people start 
using it (compare Solr's FunctionQuery and Lucenes CustomScoreQuery 
for example)
  * someone needs to care enough to follow through on the promotion.

...further discussion is best suited for java-dev since the topic is not 
Solr specific (there's a lot of Nutch code out there that people have sked 
about promoting as well)

:* How to deal with TokenFilters that add new Tokens to the stream?

This is specificly regarding Payloads yes?  also a pretty clear cut 
java-dev discussion (and one possibly already being discussed in the 
monolithic Payload API thread i haven't started reading yet).  
lucene-java sets the API and the semantics ... Solr code will follow them.

:* How to patch TokenFilters and Tokenizers using the model of
:  LUCENE-969 in the Solr code base and in Lucene contrib?

open SOLR issues containing a patchs for any Solr code that needs 
changed, and LUCENE issues containing patches for contrib code that needs 
changed.

: I thought it might be useful to figure out which existing TokenFilters need to
: know about Payloads.  To this end I have taken an inventory of the
: TokenFilters out there.  I think it is fair to categorize them by Add (A),
: Delete (D), Modify (M), Observe (O):

again: this is a straight forward luence-java question ... once the 
semantics have been worked out, then there can be a Solr specific 
discussion about following them.

(which is not to say that the Solr classes/use-cases shouldn't be 
considered in the discussion, just that java-dev is the right place to 
have the conversation)




-Hoss

Re: Payloads, Tokenizers, and Filters. Oh My!

2007-11-18 Thread Tricia Williams

I apologize for cross-posting but  I believe both Solr and Lucene users 
and developers should be concerned with this.  I am not aware of a 
better way to reach both communities.


In this email I'm looking for comments on:

   * Do TokenFilters belong in the Solr code base at all?
   * How to deal with TokenFilters that add new Tokens to the stream?
   * How to patch TokenFilters and Tokenizers using the model of
 LUCENE-969 in the Solr code base and in Lucene contrib?

Earlier in this thread I identified that at least one TokenFilter is 
eating Payloads (WordDelimiterFilter).


Yonik pointed out:

Yes, this will be an issue for many custom tokenizers that don't yet
know about payloads but that create tokens.  It's not clear what to do
in some cases when multiple tokens are created from one... should
identical payloads be created for the new tokens... it depends on what
the semantics of those payloads are.

And I responded: 
I suppose that it is only fair to take this on a case by case basis.  
Maybe we will have to write new TokenFilters for each Tokenzier that 
uses Payloads (but I sure hope not!).  Maybe we can build some 
optional configuration options into the TokenFilter constructor that 
guide their behavior with regard to Payloads.  Maybe there is 
something stored in the TokenStream that dictates how the Payloads are 
handled by the TokenFilters.  Maybe there is no case where identical 
payloads would not be created for new tokens and we can just change 
the TokenFilter to deal with payloads directly in a uniform way. 


I thought it might be useful to figure out which existing TokenFilters 
need to know about Payloads.  To this end I have taken an inventory of 
the TokenFilters out there.  I think it is fair to categorize them by 
Add (A), Delete (D), Modify (M), Observe (O):


*org.apache.solr.analysis.*HyphenatedWordsFilter, DM
*org.apache.solr.analysis.*KeepWordFilter, D
*org.apache.solr.analysis.*LengthFilter, D
*org.apache.solr.analysis.*PatternReplaceFilter, M
*org.apache.solr.analysis.*PhoneticFilter, AM
*org.apache.solr.analysis.*RemoveDuplicatesTokenFilter, D
*org.apache.solr.analysis.*SynonymFilter, ADM
*org.apache.solr.analysis.*TrimFilter, M
*org.apache.solr.analysis.*WordDelimiterFilter, AM
*org.apache.lucene.analysis.*CachingTokenFilter, O
*org.apache.lucene.analysis.*ISOLatin1AccentFilter, M
*org.apache.lucene.analysis.*LengthFilter, D
*org.apache.lucene.analysis.*LowerCaseFilter, M
*org.apache.lucene.analysis.*PorterStemFilter, M
*org.apache.lucene.analysis.*StopFilter, D
*org.apache.lucene.analysis.standard*.StandardFilter, M*
org.apache.lucene.analysis.br.*BrazilianStemFilter, M
*org.apache.lucene.analysis.cn.*ChineseFilter, D*
org.apache.lucene.analysis.de.*GermanStemFilter, M
*org.apache.lucene.analysis.el.*GreekLowerCaseFilter, M
*org.apache.lucene.analysis.fr.*ElisionFilter, M
*org.apache.lucene.analysis.fr.*FrenchStemFilter, M
*org.apache.lucene.analysis.ngram.*EdgeNGramTokenFilter, AM
*org.apache.lucene.analysis.ngram.*NGramTokenFilter, AM
*org.apache.lucene.analysis.nl.*DutchStemFilter, M
*org.apache.lucene.analysis.ru.*RussianLowerCaseFilter, M
*org.apache.lucene.analysis.ru.*RussianStemFilter, M
*org.apache.lucene.analysis.th.*ThaiWordFilter, AM
*org.apache.lucene.analysis.snowball.*SnowballFilter, M

Some characteristics of Add (A), Delete (D), Modify (M), Observe (O)
Add: new Token() and buffer of Tokens to consider before addressing 
input.next()

Delete: loop ignoring tokens based on some criteria
Modify: new Token(), or use of Token set methods
Observe: rare CachingTokenFilter

The categories of TokenFilters that are affected by Payloads are add and 
modify.  The default behavior of TokenFilters which only delete or 
observe return the Token fed through intact, hence the Payload will 
remain intact.


Maybe the Lucene community has thought about this problem?  I noticed 
that the org.apache.lucene.analysis TokenFilters in the modify category 
(there are none in the add category) refrain from using new Token().  
That led me to the comment in the JavaDocs:


*NOTE:* As of 2.3, Token stores the term text internally as a 
malleable char[] termBuffer instead of String termText. The indexing 
code and core tokenizers have been changed re-use a single Token 
instance, changing its buffer and other fields in-place as the Token 
is processed. This provides substantially better indexing performance 
as it saves the GC cost of new'ing a Token and String for every term. 
The APIs that accept String termText are still available but a warning 
about the associated performance cost has been added (below). The 
|termText()| 
http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/javadoc/org/apache/lucene/analysis/Token.html#termText%28%29 
method has been deprecated.


Tokenizers and filters should try to re-use a Token instance when 
possible for best performance, by implementing the 
|TokenStream.next(Token)|

Re: Payloads, Tokenizers, and Filters. Oh My!

Re: Payloads, Tokenizers, and Filters. Oh My!

2 matches

Site Navigation

Mail list logo

Footer information