[ https://issues.apache.org/jira/browse/SOLR-42?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12556748#action_12556748 ]
Hoss Man commented on SOLR-42: ------------------------------ Hmmm.... I was assuming this would be an option on both the HTMLStripReader and the Tokenizers that use it (the tokenizers taking the option only to pass it on to the Reader) but i see what you mean ... once the Tokenizer knows the character positions of the "words" coming out of the text, it can then strip out those characters (Hmmm... strip the characters that are a placeholder for when other characters where already striped ... why do i feel like we're going to go to hell for this?) Tf there were characters that we were *certain* would never appear in any unicode string, we could do it all under the covers by picking one of them ... but the safest thing to do would still be to have it as an option (but with a sensible default from the "private use" range instead of an empty string). ... So the HtmlStripReader would have a constructor that looks like... {code} /** * @param entityFiller character to replace gaps made when entities are collapsed to real characters so that character positions still line up, may be null if no filler should be used * @param tagFiller character to replace gaps made when entities are collapsed to real characters so that character positions still line up, may be null if no filler should be used */ public HtmlStripReader(Reader input, Character entityFiller, Character tagFiller) { ... } {code} and the Tokenizers could look like... {code} public class HTMLStripStandardTokenizerFactory extends BaseTokenizerFactory { Pattern fillerPattern; Character entityFiller, tagFiller; public void init(...) { entityFiller = getInitParam(...); tagFiller = getInitParam(...) fillerPattern = getInitParam(stripFiller) ? makePattern(entityFiller, tagFiller) : null; } public TokenStream create(Reader input) { TokenStream s = new StandardTokenizer(new HTMLStripReader(input,entityFiller, tagFiller); If (null != fillerPatterm) { s = new PatternReplaceFiler(s, fillerPattern, "", true); } return s; } } {code} that should totally work right? > Highlighting problems with HTMLStripWhitespaceTokenizerFactory > -------------------------------------------------------------- > > Key: SOLR-42 > URL: https://issues.apache.org/jira/browse/SOLR-42 > Project: Solr > Issue Type: Bug > Components: highlighter > Reporter: Andrew May > Assignee: Grant Ingersoll > Priority: Minor > Attachments: htmlStripReaderTest.html, HTMLStripReaderTest.java, > SOLR-42.patch, SOLR-42.patch, SOLR-42.patch > > > Indexing content that contains HTML markup, causes problems with highlighting > if the HTMLStripWhitespaceTokenizerFactory is used (to prevent the tag names > from being searchable). > Example title field: > <SUP>40</SUP>Ar/<SUP>39</SUP>Ar laserprobe dating of mylonitic fabrics in a > polyorogenic terrane of NW Iberia > Searching for title:fabrics with highlighting on, the highlighted version has > the <em> tags in the wrong place - 22 characters to the left of where they > should be (i.e. the sum of the lengths of the tags). > Response from Yonik on the solr-user mailing-list: > HTMLStripWhitespaceTokenizerFactory works in two phases... > HTMLStripReader removes the HTML and passes the result to > WhitespaceTokenizer... at that point, Tokens are generated, but the > offsets will correspond to the text after HTML removal, not before. > I did it this way so that HTMLStripReader could go before any > tokenizer (like StandardTokenizer). > Can you open a JIRA bug for this? The fix would be a special version > of HTMLStripReader integrated with a WhitespaceTokenizer to keep > offsets correct. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.