[jira] Commented: (SOLR-1657) convert the rest of solr to use the new tokenstream API

Robert Muir (JIRA) Wed, 06 Jan 2010 08:22:18 -0800

    [ 
https://issues.apache.org/jira/browse/SOLR-1657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12797161#action_12797161
 ]


Robert Muir commented on SOLR-1657:
-----------------------------------

Hello, I am working on WordDelimiterFilter and I have a question: how do we 
want custom attributes to work here?

This affects performance of the filter under the new tokenstream API, as it 
will determine when/if we have to save/restore state.

Here are two alternatives:

Alternative #1 (most performant): custom attributes from the original term will 
only apply to words with no delimiters, or in the case of words with 
delimiters, only the 'original' token output with the 'preserveOriginal' 
option. This is easiest to understand in my opinion, and would perform the 
best. Its arguable that if you split a term into 10 subwords, applying these 
attributes to all 10 subwords may no longer make sense 

Alternative #2: (least performant): custom attributes from the original term 
will only apply to non-injected terms: this means if a word is split into 10 
tokens, all 10 subword tokens, but not their concatenations, also have the 
attributes derived from the original term. If preserveOriginal is on, then it 
has the attributes also.

Alternative #3: ??? your ideas?

In my opinion, we should shoot for maximum performance, as I view this as 
somewhat like a tokenizer, and custom attributes in general would be applied 
after WDF, because trying to apply them before WDF and expecting them to make 
sense afterwards will be confusing. but it does not matter really.


> convert the rest of solr to use the new tokenstream API
> -------------------------------------------------------
>
>                 Key: SOLR-1657
>                 URL: https://issues.apache.org/jira/browse/SOLR-1657
>             Project: Solr
>          Issue Type: Task
>            Reporter: Robert Muir
>         Attachments: SOLR-1657.patch, SOLR-1657.patch
>
>
> org.apache.solr.analysis:
> BufferedTokenStream
>  -> -CommonGramsFilter-
>  -> -CommonGramsQueryFilter-
>  -> -RemoveDuplicatesTokenFilter-
> -CapitalizationFilterFactory-
> -HyphenatedWordsFilter-
> -LengthFilter (deprecated, remove)-
> SynonymFilter
> SynonymFilterFactory
> WordDelimiterFilter
> org.apache.solr.handler:
> AnalysisRequestHandler
> AnalysisRequestHandlerBase
> org.apache.solr.handler.component:
> QueryElevationComponent
> SpellCheckComponent
> org.apache.solr.highlight:
> DefaultSolrHighlighter
> org.apache.solr.search:
> FieldQParserPlugin
> org.apache.solr.spelling:
> SpellingQueryConverter

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1657) convert the rest of solr to use the new tokenstream API

Reply via email to