[ http://issues.apache.org/jira/browse/SOLR-11?page=all ]
Richard "Trey" Hyde updated SOLR-11:
------------------------------------
Attachment: solr.analysys.RemomveDuplicateTokensFilter.linkedhashmap.java
Alternate implementation based on LinkedHashMap instead of a LinkedList queue.
Should work better for larger data sets. Still doesn't pay attention to
position but I don't know enough to know if that is important.
> DeDupTokenFilter{Factory}
> -------------------------
>
> Key: SOLR-11
> URL: http://issues.apache.org/jira/browse/SOLR-11
> Project: Solr
> Type: Wish
> Components: search
> Reporter: Hoss Man
> Attachments: solr.analysis.RemoveDuplicateTokensFilter.java,
> solr.analysys.RemomveDuplicateTokensFilter.linkedhashmap.java
>
> I recently noticed a situation in which my Query analyzer was producing the
> same Token more then once, resulting in it getting two equally boosted
> clauses in the resulting query. In my specific case, i was using the same
> synonym file for multiple fields (some stemmed some not) and two synonyms for
> a word stemmed to the same root, which ment that particular word was worth
> twice as as any of the other variations of the synonym -- but I can imagine
> other situations where this might come up, both at index time and at query
> time, particularlay when using SynonymFilter in combination with the
> WordDelimiter filter.
> It occured to me that a DeDupFilter would be handy. In it's simplest form it
> would drop any Token it gets where the startOffset, endOffset,termText,and
> type are all identical to the previous token and the positionIncriment is 0.
> A more robust implimentation might support init options indicating that only
> certain combinations of those things should be used to determine equality
> (ie: just termText, just termText and positionIncriment=0, etc...) but in
> this case, an option might also be neccessary to determine with of the Tokens
> should be propogated (the first of the last)
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira