DeDupTokenFilter{Factory}
-------------------------

         Key: SOLR-11
         URL: http://issues.apache.org/jira/browse/SOLR-11
     Project: Solr
        Type: Wish

  Components: search  
    Reporter: Hoss Man


I recently noticed a situation in which my Query analyzer was producing the 
same Token more then once, resulting in it getting two equally boosted clauses 
in the resulting query.  In my specific case, i was using the same synonym file 
for multiple fields (some stemmed some not) and two synonyms for a word stemmed 
to the same root, which ment that particular word was worth twice as as any of 
the other variations of the synonym -- but I can imagine other situations where 
this might come up, both at index time and at query time, particularlay when 
using SynonymFilter in combination with the WordDelimiter filter.

It occured to me that a DeDupFilter would be handy.  In it's simplest form it 
would drop any Token it gets where the startOffset, endOffset,termText,and type 
are all identical to the previous token and the positionIncriment is 0.  A more 
robust implimentation might support init options indicating that only certain 
combinations of those things should be used to determine equality (ie: just 
termText, just termText and positionIncriment=0, etc...) but in this case, an 
option might also be neccessary to determine with of the Tokens should be 
propogated (the first of the last)


-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply via email to