[
https://issues.apache.org/jira/browse/SOLR-42?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Chris Harris updated SOLR-42:
-----------------------------
Attachment: TokenPrinter.java
HtmlStripReaderTestXmlProcessing.patch
The committed HtmlStripReader doesn't seem to handle offsets correctly for XML
processing instructions such as this:
<?xml version="1.0" encoding="UTF-8" ?>
I'm attaching two files:
HtmlStripReaderTestXmlProcessing.patch adds an HtmlStripReader test case to
catch the problem. (The test currently fails.)
TokenPrinter.java can help make it a little clearer what the problem actually
is. Here is the output if I run it against against the analysis code in trunk.
Note that the offsets are basically what one would expect, except in the XML
processing instructions case, where the start position is off by one:
-------------------------------------
String to test: <uniqueKey>id</uniqueKey>
Token info:
token 'id'
startOffset: 11
char at startOffset, and next few: 'id</u'
-------------------------------------
String to test: <!-- Unless this field is marked with required="false", it will
be a required field -->
<uniqueKey>id</uniqueKey>
Token info:
token 'id'
startOffset: 99
char at startOffset, and next few: 'id</u'
-------------------------------------
String to test: <!-- And now: two elements --> <element1>one</element1>
<element2>two</element2>
Token info:
token 'one'
startOffset: 41
char at startOffset, and next few: 'one</'
token 'two'
startOffset: 68
char at startOffset, and next few: 'two</'
-------------------------------------
String to test: <?xml version="1.0" encoding="UTF-8" ?><uniqueKey>id</uniqueKey>
Token info:
token 'id'
startOffset: 49
char at startOffset, and next few: '>id</'
-------------------------------------
> Highlighting problems with HTMLStripWhitespaceTokenizerFactory
> --------------------------------------------------------------
>
> Key: SOLR-42
> URL: https://issues.apache.org/jira/browse/SOLR-42
> Project: Solr
> Issue Type: Bug
> Components: highlighter
> Reporter: Andrew May
> Assignee: Grant Ingersoll
> Priority: Minor
> Attachments: htmlStripReaderTest.html, HTMLStripReaderTest.java,
> HtmlStripReaderTestXmlProcessing.patch, SOLR-42.patch, SOLR-42.patch,
> SOLR-42.patch, SOLR-42.patch, TokenPrinter.java
>
>
> Indexing content that contains HTML markup, causes problems with highlighting
> if the HTMLStripWhitespaceTokenizerFactory is used (to prevent the tag names
> from being searchable).
> Example title field:
> <SUP>40</SUP>Ar/<SUP>39</SUP>Ar laserprobe dating of mylonitic fabrics in a
> polyorogenic terrane of NW Iberia
> Searching for title:fabrics with highlighting on, the highlighted version has
> the <em> tags in the wrong place - 22 characters to the left of where they
> should be (i.e. the sum of the lengths of the tags).
> Response from Yonik on the solr-user mailing-list:
> HTMLStripWhitespaceTokenizerFactory works in two phases...
> HTMLStripReader removes the HTML and passes the result to
> WhitespaceTokenizer... at that point, Tokens are generated, but the
> offsets will correspond to the text after HTML removal, not before.
> I did it this way so that HTMLStripReader could go before any
> tokenizer (like StandardTokenizer).
> Can you open a JIRA bug for this? The fix would be a special version
> of HTMLStripReader integrated with a WhitespaceTokenizer to keep
> offsets correct.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.