Token startOffsets with HtmlStripReader

Chris Harris Wed, 13 Feb 2008 17:28:47 -0800

https://issues.apache.org/jira/browse/SOLR-42 changed the
HtmlStripReader so that Tokens from a TokenStream made with
HTMLStripWhitespaceTokenizerFactory would have the correct
Token.startOffset() values. If I'm not mistaken, though, the
HtmlStripReader in trunk still doesn't get offsets quite right where
XML processing instructions like


  <?xml version="1.0" encoding="UTF-8" ?>

are concerned. SOLR-42 is marked as resolved, so I'll write what I
know right here. I'm wondering if someone who understands
HtmlStripReader a little bit more than me could fix this in like two
minutes.

To demonstrate the problem, I made a little test class that will
tokenize some text with the HTMLStripWhitespaceTokenizer, and then
display both the startOffset of each token and the first few
characters on and after the startOffset. As you can see, things work
fine for most test strings, but in the case with processing
instructions, the startOffset is off by one character. Here's the
output:

-------------------------------------
String to test: <uniqueKey>id</uniqueKey>
  Token info:
    token 'id'
      startOffset: 11
      char at startOffset, and next few: 'id</u'
-------------------------------------
String to test: <!-- Unless this field is marked with
required="false", it will be a required field -->
<uniqueKey>id</uniqueKey>
  Token info:
    token 'id'
      startOffset: 99
      char at startOffset, and next few: 'id</u'
-------------------------------------
String to test: <!-- And now: two elements --> <element1>one</element1>
  <element2>two</element2>
  Token info:
    token 'one'
      startOffset: 41
      char at startOffset, and next few: 'one</'
    token 'two'
      startOffset: 68
      char at startOffset, and next few: 'two</'
-------------------------------------
String to test: <?xml version="1.0" encoding="UTF-8" ?><uniqueKey>id</uniqueKey>
  Token info:
    token 'id'
      startOffset: 49
      char at startOffset, and next few: '>id</'
-------------------------------------

I've also modified one of the existing test cases to identify the
problem. I will paste the rest of my code below.

Thanks,
Chris

*******************************

[Source code for the test program whose output appears above]

import java.io.Reader;
import java.io.FileReader;
import java.io.IOException;
import java.io.StringReader;
import org.apache.lucene.analysis.*;
import org.apache.solr.analysis.*;


public class Baz
{
        public static void main(String args[]) throws IOException
        {
                String singleElement = "<uniqueKey>id</uniqueKey>";
                String singleElementWithComment = "<!-- Unless this field is 
marked
with required=\"false\", it will be a required field -->
<uniqueKey>id</uniqueKey>";
                String twoElementsWithComment = "<!-- And now: two elements -->
<element1>one</element1>\n  <element2>two</element2>";
                String elementWithXmlHeader = "<?xml version=\"1.0\"
encoding=\"UTF-8\" ?><uniqueKey>id</uniqueKey>";


                testStr(singleElement);
                testStr(singleElementWithComment);
                testStr(twoElementsWithComment);
                testStr(elementWithXmlHeader);
        }

        static void testStr(String s) throws IOException
        {
                System.out.println("-------------------------------------");
                System.out.println("String to test: " + s);
                System.out.println("  Token info:");
                StringReader reader = new StringReader(s);

                HTMLStripWhitespaceTokenizerFactory factory = new
HTMLStripWhitespaceTokenizerFactory();

                //This standard factory also gets processing instructions wrong:
                //HTMLStripStandardTokenizerFactory factory = new
HTMLStripStandardTokenizerFactory();
                
                TokenStream ts = factory.create(reader);

                while (true)
                {
                        Token t = ts.next();
                        if (t == null)
                        {
                                break;
                        }
                
                        String tokenText = new String(t.termBuffer(), 0, 
t.termLength());       
                        String startOffsetStr = s.substring(t.startOffset(), 
t.startOffset()+5);
                        System.out.println("    token '" + tokenText + "'");
                        System.out.println("      startOffset: " + 
t.startOffset());
                        System.out.println("      char at startOffset, and next 
few: '" +
startOffsetStr + "'");
                }       
        }
}

***************************

[Here's the unit test]

    public void testXmlProcessingInstruction() throws IOException {
    String html = "<?xml version=\"1.0\" encoding=\"UTF-8\" ?><p>Here
is a paragraph.</p>";
    String gold = "                                          Here is a
paragraph.    ";
    HTMLStripReader reader = new HTMLStripReader(new StringReader(html));
    StringBuilder builder = new StringBuilder();
    int ch = -1;
    char [] goldArray = gold.toCharArray();
    int position = 0;
    while ((ch = reader.read()) != -1){
      char theChar = (char) ch;
      builder.append(theChar);
      assertTrue("\"" + theChar + "\"" + " at position: " + position +
" does not equal: " + goldArray[position]
              + " Buffer so far: " + builder + "<EOB>", theChar ==
goldArray[position]);
      position++;
    }
    assertTrue(gold + " is not equal to " + builder.toString(),
gold.equals(builder.toString()) == true);
  }

Token startOffsets with HtmlStripReader

Reply via email to