Hi, Kindly help me understand the way HTMLStripCharFilter works.
I have following analysis chain. int flags = WordDelimiterFilter.GENERATE_WORD_PARTS | WordDelimiterFilter.GENERATE_NUMBER_PARTS | WordDelimiterFilter.CATENATE_WORDS | WordDelimiterFilter.CATENATE_NUMBERS | WordDelimiterFilter.CATENATE_ALL | WordDelimiterFilter.SPLIT_ON_CASE_CHANGE | WordDelimiterFilter.STEM_ENGLISH_POSSESSIVE | WordDelimiterFilter.PRESERVE_ORIGINAL; @Override protected Reader initReader(String field, Reader reader) { return new HTMLStripCharFilter(reader); } @Override protected TokenStreamComponents createComponents(String arg0) { Tokenizer source = new WhitespaceTokenizer(); TokenStream wordDMTStrem = new WordDelimiterFilter(source, flags, null); TokenStream rdtStream = new RemoveDuplicatesTokenFilter(wordDMTStrem); return new TokenStreamComponents(source, rdtStream); } *teRm<sub>3</sub>* returns following analyzed tokens by above analysis chain. *Text Position Increment Position Length Offset attribute* teRm3 1 1 0, 16 Rm3 1 1 0, 16 te 0 1 0, 16 teRm3 0 1 0, 16 Here in the above table teRm3 has occurred twice but not removed by RemoveDuplicatesTokenFilter. Whereas *teRm3* gets tokenized with the same analysis chain as below . *Text Position Increment Position Length Offset attribute* teRm3 1 1 0, 5 te 0 1 0, 2 Rm3 1 1 2, 5 Here in above table *teRm3* was removed by RemoveDuplicatesTokenFilter so no duplicate for it. Please share your comments on this difference in behavior of analysis. Thanks, Modassar