[ https://issues.apache.org/jira/browse/SOLR-42?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12556104#action_12556104 ]
Yonik Seeley commented on SOLR-42: ---------------------------------- Grant, I'm getting a test failure... did you forget to "svn add" some files? <error message="src\test\test-files\htmlStripReaderTest.html (The system ca not find the path specified)" type="java.io.FileNotFoundException">java.io.File otFoundException: src\test\test-files\htmlStripReaderTest.html (The system cann t find the path specified) at java.io.FileInputStream.open(Native Method) at java.io.FileInputStream.<init>(FileInputStream.java:106) at java.io.FileReader.<init>(FileReader.java:55) at org.apache.solr.analysis.HTMLStripReaderTest.testHTML(HTMLStripReade Test.java:65) > Highlighting problems with HTMLStripWhitespaceTokenizerFactory > -------------------------------------------------------------- > > Key: SOLR-42 > URL: https://issues.apache.org/jira/browse/SOLR-42 > Project: Solr > Issue Type: Bug > Components: highlighter > Reporter: Andrew May > Assignee: Grant Ingersoll > Priority: Minor > Attachments: HTMLStripReaderTest.java, SOLR-42.patch > > > Indexing content that contains HTML markup, causes problems with highlighting > if the HTMLStripWhitespaceTokenizerFactory is used (to prevent the tag names > from being searchable). > Example title field: > <SUP>40</SUP>Ar/<SUP>39</SUP>Ar laserprobe dating of mylonitic fabrics in a > polyorogenic terrane of NW Iberia > Searching for title:fabrics with highlighting on, the highlighted version has > the <em> tags in the wrong place - 22 characters to the left of where they > should be (i.e. the sum of the lengths of the tags). > Response from Yonik on the solr-user mailing-list: > HTMLStripWhitespaceTokenizerFactory works in two phases... > HTMLStripReader removes the HTML and passes the result to > WhitespaceTokenizer... at that point, Tokens are generated, but the > offsets will correspond to the text after HTML removal, not before. > I did it this way so that HTMLStripReader could go before any > tokenizer (like StandardTokenizer). > Can you open a JIRA bug for this? The fix would be a special version > of HTMLStripReader integrated with a WhitespaceTokenizer to keep > offsets correct. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.