Hi, I was wondering if the solution for the Highlighting problems with HTMLStripWhitespaceTokenizerFactory (see http://issues.apache.org/jira/browse/SOLR-42) could be resolved in the following simple way.
The HTMLStripWhitespaceTokenizerFactory basically passes through the input through an HTMLStripReader which removes the HTML and then passes to the WhitespaceTokenizer. If the HTMLStripReader would simply replace the HTML with spaces (same length as the removed HTML part) then the positions for the highlighter would be correct. And most of the Tokenizers would be happy with this solution (except maybe the KeywordTokenizer). mirko
