[jira] Updated: (SOLR-11) DeDupTokenFilter{Factory}

Hoss Man (JIRA) Thu, 29 Jun 2006 12:20:12 -0700

     [ http://issues.apache.org/jira/browse/SOLR-11?page=all ]


Hoss Man updated SOLR-11:
-------------------------

    Attachment: SOLR-11-BufferedTokenStream-RemoveDuplicatesTokenFilter.patch

Back in this email...

http://www.nabble.com/BufferingTokenStream-and-RemoveDups-p4320716.html

...yonik posted an off the cuff solution to this probem that also included a 
nice reusable "BufferedTokenStream"

I've cleaned it up, added some tests, and fixed the bugs the tests surfaced 
(mainly the infinite loops and the fact that the dup detection ignored every 
token with a non zero position gap thta was followed by a 0 position gap)

I'll commit this in the next few days unless anyone has any objections/comments 
regarding ways to imporve it.  the RemoveDuplicatesTokenFilter.process method 
is much less elegant then Yonik's orriginal version, but that's the only way i 
could get it to work.  I'd welcome any suggestions on regaining the elegance 
without breaking the tests.



> DeDupTokenFilter{Factory}
> -------------------------
>
>          Key: SOLR-11
>          URL: http://issues.apache.org/jira/browse/SOLR-11
>      Project: Solr
>         Type: Wish

>   Components: search
>     Reporter: Hoss Man
>  Attachments: SOLR-11-BufferedTokenStream-RemoveDuplicatesTokenFilter.patch, 
> solr.analysis.RemoveDuplicateTokensFilter.java, 
> solr.analysys.RemomveDuplicateTokensFilter.linkedhashmap.java
>
> I recently noticed a situation in which my Query analyzer was producing the 
> same Token more then once, resulting in it getting two equally boosted 
> clauses in the resulting query.  In my specific case, i was using the same 
> synonym file for multiple fields (some stemmed some not) and two synonyms for 
> a word stemmed to the same root, which ment that particular word was worth 
> twice as as any of the other variations of the synonym -- but I can imagine 
> other situations where this might come up, both at index time and at query 
> time, particularlay when using SynonymFilter in combination with the 
> WordDelimiter filter.
> It occured to me that a DeDupFilter would be handy.  In it's simplest form it 
> would drop any Token it gets where the startOffset, endOffset,termText,and 
> type are all identical to the previous token and the positionIncriment is 0.  
> A more robust implimentation might support init options indicating that only 
> certain combinations of those things should be used to determine equality 
> (ie: just termText, just termText and positionIncriment=0, etc...) but in 
> this case, an option might also be neccessary to determine with of the Tokens 
> should be propogated (the first of the last)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Updated: (SOLR-11) DeDupTokenFilter{Factory}

Reply via email to