Generic RemoveDuplicatesTokenFilter

pravesh Mon, 12 Dec 2011 23:31:21 -0800

Hi All,

Currently, the SOLR's existing 
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.RemoveDuplicatesTokenFilterFactory
RemoveDuplicatesTokenFilter   filters the duplicate tokens with the same
text and logical at the same position.


In my case, if the same term appears duplicate one after the other then i
need to remove all duplicates and consume only single occurance of the term
(even if the positionincrementgap ==1).

For e.g. the input stream is as:  /quick brown brown brown fox jumps jumps
over the little little lazy brown dog/
Then the output shld be:  quick brown fox jumps over the little lazy brown
dog.

To acheive this, I implemented my own version of
/RemoveDuplicatesTokenFilter/ with overridden /process()/ method as:

  protected Token process(Token t) throws IOException {
          Token nextTok = peek(1);
          if(t!=null && nextTok!=null){
                 if(t.termText().equalsIgnoreCase(nextTok.termText())){
                    return null;
                  }
          }
          return t;
  }

The above implementation works as per desired and the continuous duplicates
are getting removed :)

Any advice/feedback for the above implementation :)

Regards
Pravesh

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Generic-RemoveDuplicatesTokenFilter-tp3581656p3581656.html
Sent from the Solr - User mailing list archive at Nabble.com.

Generic RemoveDuplicatesTokenFilter

Reply via email to