Hi,

I was wondering if the solution for the Highlighting problems with
HTMLStripWhitespaceTokenizerFactory (see
http://issues.apache.org/jira/browse/SOLR-42) could be resolved in
the following simple way.

The HTMLStripWhitespaceTokenizerFactory basically passes through the
input through an HTMLStripReader which removes the HTML and then passes
to the WhitespaceTokenizer.  If the HTMLStripReader would simply replace
the HTML with spaces (same length as the removed HTML part) then the positions
for the highlighter would be correct.  And most of the Tokenizers would
be happy with this solution (except maybe the KeywordTokenizer).

mirko

Reply via email to