Christian Moen created LUCENE-6216:
--------------------------------------

             Summary: Make it easier to modify Japanese token attributes 
downstream
                 Key: LUCENE-6216
                 URL: https://issues.apache.org/jira/browse/LUCENE-6216
             Project: Lucene - Core
          Issue Type: Improvement
          Components: modules/analysis
            Reporter: Christian Moen
            Priority: Minor


Japanese-specific token attributes such as {{PartOfSpeechAttribute}}, 
{{BaseFormAttribute}}, etc. get their values from a 
{{org.apache.lucene.analysis.ja.Token}} through a {{setToken()}} method.  This 
makes it cumbersome to change these token attributes later on in the analysis 
chain since the {{Token}} instances are difficult to instantiate (sort of 
read-only objects).

I've ran into this issue in LUCENE-3922 (JapaneseNumberFilter) where it would 
be appropriate to update token attributes to also reflect Japanese number 
normalization.

I think it might be more practical to allow setting a specific value for these 
token attributes directly rather than through a {{Token}} since it makes the 
APIs simpler, allows for easier changing attributes downstream, and also 
supporting additional dictionaries easier.

The drawback with the approach that I can think of is a performance hit as we 
will miss out on the inherent lazy retrieval of these token attributes from the 
{{Token}} object (and the underlying dictionary/buffer).

I'd like to do some testing to better understand the performance impact of this 
change. Happy to hear your thoughts on this.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to