DeDupTokenFilter{Factory}
-------------------------
Key: SOLR-11
URL: http://issues.apache.org/jira/browse/SOLR-11
Project: Solr
Type: Wish
Components: search
Reporter: Hoss Man
I recently noticed a situation in which my Query analyzer was producing the
same Token more then once, resulting in it getting two equally boosted clauses
in the resulting query. In my specific case, i was using the same synonym file
for multiple fields (some stemmed some not) and two synonyms for a word stemmed
to the same root, which ment that particular word was worth twice as as any of
the other variations of the synonym -- but I can imagine other situations where
this might come up, both at index time and at query time, particularlay when
using SynonymFilter in combination with the WordDelimiter filter.
It occured to me that a DeDupFilter would be handy. In it's simplest form it
would drop any Token it gets where the startOffset, endOffset,termText,and type
are all identical to the previous token and the positionIncriment is 0. A more
robust implimentation might support init options indicating that only certain
combinations of those things should be used to determine equality (ie: just
termText, just termText and positionIncriment=0, etc...) but in this case, an
option might also be neccessary to determine with of the Tokens should be
propogated (the first of the last)
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira