Hi,
Kindly help me understand the way HTMLStripCharFilter works.
I have following analysis chain.
int flags = WordDelimiterFilter.GENERATE_WORD_PARTS
| WordDelimiterFilter.GENERATE_NUMBER_PARTS
| WordDelimiterFilter.CATENATE_WORDS
| WordDelimiterFilter.CATENATE_NUMBERS
| WordDelimiterFilter.CATENATE_ALL
| WordDelimiterFilter.SPLIT_ON_CASE_CHANGE
| WordDelimiterFilter.STEM_ENGLISH_POSSESSIVE
| WordDelimiterFilter.PRESERVE_ORIGINAL;
@Override
protected Reader initReader(String field, Reader reader) {
return new HTMLStripCharFilter(reader);
}
@Override
protected TokenStreamComponents createComponents(String arg0) {
Tokenizer source = new WhitespaceTokenizer();
TokenStream wordDMTStrem = new WordDelimiterFilter(source, flags,
null);
TokenStream rdtStream = new
RemoveDuplicatesTokenFilter(wordDMTStrem);
return new TokenStreamComponents(source, rdtStream);
}
*teRm<sub>3</sub>* returns following analyzed tokens by above analysis
chain.
*Text Position Increment Position Length Offset attribute*
teRm3 1 1 0,
16
Rm3 1 1
0, 16
te 0 1
0, 16
teRm3 0 1 0,
16
Here in the above table teRm3 has occurred twice but not removed by
RemoveDuplicatesTokenFilter.
Whereas *teRm3* gets tokenized with the same analysis chain as below .
*Text Position Increment Position Length Offset attribute*
teRm3 1 1 0, 5
te 0 1 0, 2
Rm3 1 1 2, 5
Here in above table *teRm3* was removed by RemoveDuplicatesTokenFilter so
no duplicate for it.
Please share your comments on this difference in behavior of analysis.
Thanks,
Modassar