[ 
https://issues.apache.org/jira/browse/SOLR-42?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12556729#action_12556729
 ] 

Hoss Man commented on SOLR-42:
------------------------------

I don't know much about unicode, but there are *so* many special characters in 
unicode, i just have to wonder if there is a special marker character that 
could be used instead of whitespace to "fill in the gaps" left when converting 
entities to real characters (or stripping tags).  ... something that isn't 
printable, and does't trigger any "boundary" logic (ie: note whitespace, 
punctuation, letter, digit, etc...)

 * NUL perhaps?  (can you legally embed null in a string in java?)
 * does anyone understand the definition of a "nonspacing mark" ?
 * the "Invisible Separator" character?
 * a "private use" character?  (this actually seems like the most promising 
option)

I say we just punt: have two options that allows users to specify characters: 
one for when tags are striped, one for when entities are converted to normal 
characters ... default both to an empty string (ie: current behavior)

> Highlighting problems with HTMLStripWhitespaceTokenizerFactory
> --------------------------------------------------------------
>
>                 Key: SOLR-42
>                 URL: https://issues.apache.org/jira/browse/SOLR-42
>             Project: Solr
>          Issue Type: Bug
>          Components: highlighter
>            Reporter: Andrew May
>            Assignee: Grant Ingersoll
>            Priority: Minor
>         Attachments: htmlStripReaderTest.html, HTMLStripReaderTest.java, 
> SOLR-42.patch, SOLR-42.patch, SOLR-42.patch
>
>
> Indexing content that contains HTML markup, causes problems with highlighting 
> if the HTMLStripWhitespaceTokenizerFactory is used (to prevent the tag names 
> from being searchable).
> Example title field:
> <SUP>40</SUP>Ar/<SUP>39</SUP>Ar laserprobe dating of mylonitic fabrics in a 
> polyorogenic terrane of NW Iberia
> Searching for title:fabrics with highlighting on, the highlighted version has 
> the <em> tags in the wrong place - 22 characters to the left of where they 
> should be (i.e. the sum of the lengths of the tags).
> Response from Yonik on the solr-user mailing-list:
> HTMLStripWhitespaceTokenizerFactory works in two phases...
> HTMLStripReader removes the HTML and passes the result to
> WhitespaceTokenizer... at that point, Tokens are generated, but the
> offsets will correspond to the text after HTML removal, not before.
> I did it this way so that HTMLStripReader  could go before any
> tokenizer (like StandardTokenizer).
> Can you open a JIRA bug for this?  The fix would be a special version
> of HTMLStripReader integrated with a WhitespaceTokenizer to keep
> offsets correct. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to