[jira] Commented: (SOLR-42) Highlighting problems with HTMLStripWhitespaceTokenizerFactory

Grant Ingersoll (JIRA) Tue, 08 Jan 2008 08:19:58 -0800

    [ 
https://issues.apache.org/jira/browse/SOLR-42?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12556928#action_12556928
 ]


Grant Ingersoll commented on SOLR-42:
-------------------------------------

Yet another problem: 
In certain circumstances, it is possible that restoreState() cannot be invoked 
b/c the mark has been lost due to moving well beyond it.  This is most 
noticeable in the while (true) loop inside of readProcessingInstruction() and 
can be caused by the following test:
{code}
public void testQuestionMark() throws Exception {
    StringBuilder testBuilder = new StringBuilder(5020);
    testBuilder.append("ah<?> ");
    for (int i = 0; i < 5000; i++){
      testBuilder.append('a');//tack on enough to go beyond the mark readahead 
limit, since <?> makes HTMLStripReader think it is a processing instruction
    }
    String test = testBuilder.toString();
    Reader reader = new HTMLStripReader(new BufferedReader(new 
StringReader(test)));//force the use of BufferedReader
    int ch = 0;
    StringBuilder builder = new StringBuilder();
    try {
      while ((ch = reader.read()) != -1){
        builder.append((char)ch);
      }
    } finally {
      System.out.println("String: " + builder.toString());
    }
    assertTrue(builder.toString() + " is not equal to " + test, 
builder.toString().equals(test) == true);
  }
{code}

In this case, the final assert never gets hit, because there is an IOException 
in reader.read of:
{noformat}
java.io.IOException: Mark invalid
        at java.io.BufferedReader.reset(BufferedReader.java:485)
        at 
org.apache.solr.analysis.HTMLStripReader.restoreState(HTMLStripReader.java:158)
        at 
org.apache.solr.analysis.HTMLStripReader.read(HTMLStripReader.java:731)
        at 
org.apache.solr.analysis.HTMLStripReaderTest.testQuestionMark(HTMLStripReaderTest.java:171)
{noformat}

> Highlighting problems with HTMLStripWhitespaceTokenizerFactory
> --------------------------------------------------------------
>
>                 Key: SOLR-42
>                 URL: https://issues.apache.org/jira/browse/SOLR-42
>             Project: Solr
>          Issue Type: Bug
>          Components: highlighter
>            Reporter: Andrew May
>            Assignee: Grant Ingersoll
>            Priority: Minor
>         Attachments: htmlStripReaderTest.html, HTMLStripReaderTest.java, 
> SOLR-42.patch, SOLR-42.patch, SOLR-42.patch
>
>
> Indexing content that contains HTML markup, causes problems with highlighting 
> if the HTMLStripWhitespaceTokenizerFactory is used (to prevent the tag names 
> from being searchable).
> Example title field:
> <SUP>40</SUP>Ar/<SUP>39</SUP>Ar laserprobe dating of mylonitic fabrics in a 
> polyorogenic terrane of NW Iberia
> Searching for title:fabrics with highlighting on, the highlighted version has 
> the <em> tags in the wrong place - 22 characters to the left of where they 
> should be (i.e. the sum of the lengths of the tags).
> Response from Yonik on the solr-user mailing-list:
> HTMLStripWhitespaceTokenizerFactory works in two phases...
> HTMLStripReader removes the HTML and passes the result to
> WhitespaceTokenizer... at that point, Tokens are generated, but the
> offsets will correspond to the text after HTML removal, not before.
> I did it this way so that HTMLStripReader  could go before any
> tokenizer (like StandardTokenizer).
> Can you open a JIRA bug for this?  The fix would be a special version
> of HTMLStripReader integrated with a WhitespaceTokenizer to keep
> offsets correct. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-42) Highlighting problems with HTMLStripWhitespaceTokenizerFactory

Reply via email to