[
https://issues.apache.org/jira/browse/SOLR-42?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12556104#action_12556104
]
Yonik Seeley commented on SOLR-42:
----------------------------------
Grant, I'm getting a test failure... did you forget to "svn add" some files?
<error message="src\test\test-files\htmlStripReaderTest.html (The system ca
not find the path specified)" type="java.io.FileNotFoundException">java.io.File
otFoundException: src\test\test-files\htmlStripReaderTest.html (The system cann
t find the path specified)
at java.io.FileInputStream.open(Native Method)
at java.io.FileInputStream.<init>(FileInputStream.java:106)
at java.io.FileReader.<init>(FileReader.java:55)
at org.apache.solr.analysis.HTMLStripReaderTest.testHTML(HTMLStripReade
Test.java:65)
> Highlighting problems with HTMLStripWhitespaceTokenizerFactory
> --------------------------------------------------------------
>
> Key: SOLR-42
> URL: https://issues.apache.org/jira/browse/SOLR-42
> Project: Solr
> Issue Type: Bug
> Components: highlighter
> Reporter: Andrew May
> Assignee: Grant Ingersoll
> Priority: Minor
> Attachments: HTMLStripReaderTest.java, SOLR-42.patch
>
>
> Indexing content that contains HTML markup, causes problems with highlighting
> if the HTMLStripWhitespaceTokenizerFactory is used (to prevent the tag names
> from being searchable).
> Example title field:
> <SUP>40</SUP>Ar/<SUP>39</SUP>Ar laserprobe dating of mylonitic fabrics in a
> polyorogenic terrane of NW Iberia
> Searching for title:fabrics with highlighting on, the highlighted version has
> the <em> tags in the wrong place - 22 characters to the left of where they
> should be (i.e. the sum of the lengths of the tags).
> Response from Yonik on the solr-user mailing-list:
> HTMLStripWhitespaceTokenizerFactory works in two phases...
> HTMLStripReader removes the HTML and passes the result to
> WhitespaceTokenizer... at that point, Tokens are generated, but the
> offsets will correspond to the text after HTML removal, not before.
> I did it this way so that HTMLStripReader could go before any
> tokenizer (like StandardTokenizer).
> Can you open a JIRA bug for this? The fix would be a special version
> of HTMLStripReader integrated with a WhitespaceTokenizer to keep
> offsets correct.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.