Re: HTML decoder is splitting tokens

Anders Melchiorsen Fri, 28 Aug 2009 14:56:39 -0700

Greetings.

I am moving this issue from the solr-user list. As can be seen in the
messages below, I am having problems with the Solr HTML stripper.


After some investigation, I have found the cause to be that the
stripper is replacing the removed HTML with spaces. This obviously
breaks when the HTML is in the middle of a word, like "G&uuml;nther".

So, without knowing what I was doing, I hacked together a fix that
uses offset correction instead.

That seemed to work, except that closing tags and attributes still
broke the positioning. With even less of a clue, I replaced read()
with next() in the two methods handling those.

Finally, invalid HTML also gave wrong offsets, and I fixed that by
restoring numRead when rolling back the input stream.

At this point I stopped trying to break it, so there may still be more
problems. Or I might have introduced some problem on my own. Anyway, I
have put the three patches at the bottom of this mail, in case
somebody wants to move along with this issue.



Regards,
Anders.



"Anders Melchiorsen" <m...@spoon.kalibalik.dk> writes:

> Hello.
>
> Thanks for the hints. Still some trouble, though.
>
> I added just the HTMLStripCharFilterFactory because, according to
> documentation, it should also replace HTML entities. It did, but
> still left a space after the entity, so I got two tokens from
> "G&uuml;nther". That seems like a bug?
>
> Adding MappingCharFilterFactory in front of the HTML stripper (so
> that the latter will not see the entity) does work as expected. That
> is, until I try strings like "use &lt;p&gt; to mark a paragraph",
> where the HTML stripper will then remove parts of the actual text.
> So this approach will not work.
>
>
> Finally, I was happy that I could now use an arbitrary tokenizer
> with HTML input. The PatternTokenizer, however, seems to be using
> character offsets corresponding to the output of the char filters,
> and so the highlighting markers end up at the wrong place. Is that a
> bug, or a configuration issue?
>
>
> Cheers,
> Anders.
>
>
> Koji Sekiguchi wrote:
>> Hi Anders,
>>
>> Sorry, I don't know this is a bug or a feature, but
>> I'd like to show an alternate way if you'd like.
>>
>> In Solr trunk, HTMLStripWhitespaceTokenizerFactory is
>> marked as deprecated. Instead, HTMLStripCharFilterFactory and
>> an arbitrary TokenizerFactory are encouraged to use.
>> And I'd recommend you to use MappingCharFilterFactory
>> to convert character references to real characters.
>> That is, you have:
>>
>> <fieldType name="textHtml" class="solr.TextField" >
>>   <analyzer>
>>     <charFilter class="solr.MappingCharFilterFactory"
>> mapping="mapping.txt"/>
>>     <charFilter class="solr.HTMLStripCharFilterFactory"/>
>>     <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>   </analyzer>
>> </fieldType>
>>
>> where the contents of mapping.txt:
>>
>> "&uuml;" => "ü"
>> "&auml;" => "ä"
>> "&iuml;" => "ï"
>> "&euml;" => "ë"
>> "&ouml;" => "ö"
>>     :             :
>>
>> Then run analysis.jsp and see the result.
>>
>> Thank you,
>>
>> Koji
>>
>>
>> Anders Melchiorsen wrote:
>>> Hi.
>>>
>>> When indexing the string "G&uuml;nther" with
>>> HTMLStripWhitespaceTokenizerFactory (in analysis.jsp), I get two
>>> tokens, "Gü" and "nther".
>>>
>>> Is this a bug, or am I doing something wrong?
>>>
>>> (Using a Solr nightly from 2009-05-29)
>>>
>>>
>>> Anders.
>>>


commit 1fb2d42181d8effb1b444aa2fa02d86df1d860d7
Author: Anders Melchiorsen <m...@spoon.kalibalik.dk>
Date:   Fri Aug 28 15:57:03 2009 +0200

    Use offset correction instead of inserting spaces into the stream
    
    Fixes "G&uuml;nther" turning into "Gü     nther".

diff --git a/HTMLStripCharFilter.java b/HTMLStripCharFilter.java
index 733d783..e473cef 100644
--- a/HTMLStripCharFilter.java
+++ b/HTMLStripCharFilter.java
@@ -37,7 +37,9 @@ public class HTMLStripCharFilter extends BaseCharFilter {
   private int readAheadLimit = DEFAULT_READ_AHEAD;
   private int safeReadAheadLimit = readAheadLimit - 3;
   private int numWhitespace = 0;
+  private int numWhitespaceCorrected = 0;
   private int numRead = 0;
+  private int numReadLast = 0;
   private int lastMark;
   private Set<String> escapedTags;
 
@@ -674,9 +676,11 @@ public class HTMLStripCharFilter extends BaseCharFilter {
     // where do we have to worry about them?
     // <![ CDATA [ unescaped markup ]]>
     if (numWhitespace > 0){
-      numWhitespace--;
-      return ' ';
+      addOffCorrectMap(numReadLast+1-numWhitespaceCorrected, 
numWhitespaceCorrected+numWhitespace);
+      numWhitespaceCorrected += numWhitespace;
+      numWhitespace = 0;
     }
+    numReadLast = numRead;
     //do not limit this one by the READAHEAD
     while(true) {
       int lastNumRead = numRead;

commit 542f5734136bbfd72ae802c30b6c61361268bccf
Author: Anders Melchiorsen <m...@spoon.kalibalik.dk>
Date:   Fri Aug 28 15:57:29 2009 +0200

    Use next() in place of read()
    
    The read() method is our public interface, while next()
    is what we use internally to get the next character.

diff --git a/HTMLStripCharFilter.java b/HTMLStripCharFilter.java
index e473cef..ab14de5 100644
--- a/HTMLStripCharFilter.java
+++ b/HTMLStripCharFilter.java
@@ -537,13 +537,13 @@ public class HTMLStripCharFilter extends BaseCharFilter {
 
   private int readName(boolean checkEscaped) throws IOException {
     StringBuilder builder = (checkEscaped && escapedTags!=null) ? new 
StringBuilder() : null;
-    int ch = read();
+    int ch = next();
     if (builder!=null) builder.append((char)ch);
     if (!isFirstIdChar(ch)) return MISMATCH;
-    ch = read();
+    ch = next();
     if (builder!=null) builder.append((char)ch);
     while(isIdChar(ch)) {
-      ch=read();
+      ch = next();
       if (builder!=null) builder.append((char)ch);
     }
     if (ch!=-1) {
@@ -572,11 +572,11 @@ public class HTMLStripCharFilter extends BaseCharFilter {
     //  <a href="a/<!--#echo "path"-->">
     private int readAttr2() throws IOException {
     if ((numRead - lastMark < safeReadAheadLimit)) {
-      int ch = read();
+      int ch = next();
       if (!isFirstIdChar(ch)) return MISMATCH;
-      ch = read();
+      ch = next();
       while(isIdChar(ch) && ((numRead - lastMark) < safeReadAheadLimit)){
-        ch=read();
+        ch = next();
       }
       if (isSpace(ch)) ch = nextSkipWS();
 

commit fdaa0920e2dceeb33e534138fe4a672914aff0ea
Author: Anders Melchiorsen <m...@spoon.kalibalik.dk>
Date:   Fri Aug 28 16:31:34 2009 +0200

    Restore the numRead variable when rolling back the stream
    
    This fixes offset corrections after invalid HTML input, like
    "hi &<< <b>there</b>".

diff --git a/HTMLStripCharFilter.java b/HTMLStripCharFilter.java
index ab14de5..4bfa85b 100644
--- a/HTMLStripCharFilter.java
+++ b/HTMLStripCharFilter.java
@@ -171,6 +171,7 @@ public class HTMLStripCharFilter extends BaseCharFilter {
   private void restoreState() throws IOException {
     input.reset();
     pushed.setLength(0);
+    numRead = lastMark;
   }
 
   private int readNumericEntity() throws IOException {

commit 571537795af2edb54543db2f71550662b0a18e60
Author: Anders Melchiorsen <a...@gnu.jobsafari.dk>
Date:   Fri Aug 28 23:49:03 2009 +0200

    Update some tests.

diff --git a/HTMLStripCharFilterTest.java b/HTMLStripCharFilterTest.java
index 7be7c7e..4730830 100644
--- a/HTMLStripCharFilterTest.java
+++ b/HTMLStripCharFilterTest.java
@@ -49,9 +49,9 @@ public class HTMLStripCharFilterTest extends TestCase {
     String html = "<div class=\"foo\">this is some text</div> here is a <a 
href=\"#bar\">link</a> and " +
             "another <a href=\"http://lucene.apache.org/\";>link</a>. " +
             "This is an entity: &amp; plus a &lt;.  Here is an &. <!-- is a 
comment -->";
-    String gold = "                 this is some text       here is a          
      link     and " +
-            "another                                     link    . " +
-            "This is an entity: &     plus a <   .  Here is an &.              
        ";
+    String gold = " this is some text  here is a  link  and " +
+            "another  link . " +
+            "This is an entity: & plus a <.  Here is an &. ";
     HTMLStripCharFilter reader = new HTMLStripCharFilter(CharReader.get(new 
StringReader(html)));
     StringBuilder builder = new StringBuilder();
     int ch = -1;
@@ -87,7 +87,7 @@ public class HTMLStripCharFilterTest extends TestCase {
 
   public void testGamma() throws Exception {
     String test = "&Gamma;";
-    String gold = "\u0393      ";
+    String gold = "\u0393";
     Set<String> set = new HashSet<String>();
     set.add("reserved");
     Reader reader = new HTMLStripCharFilter(CharReader.get(new 
StringReader(test)), set);
@@ -104,7 +104,7 @@ public class HTMLStripCharFilterTest extends TestCase {
 
   public void testEntities() throws Exception {
     String test = "&nbsp; &lt;foo&gt; &#61; &Gamma; bar &#x393;";
-    String gold = "       <   foo>    =     \u0393       bar \u0393     ";
+    String gold = "  <foo> = \u0393 bar \u0393";
     Set<String> set = new HashSet<String>();
     set.add("reserved");
     Reader reader = new HTMLStripCharFilter(CharReader.get(new 
StringReader(test)), set);
@@ -121,7 +121,7 @@ public class HTMLStripCharFilterTest extends TestCase {
 
   public void testMoreEntities() throws Exception {
     String test = "&nbsp; &lt;junk/&gt; &nbsp; &#33; &#64; and &#8217;";
-    String gold = "       <   junk/>           !     @     and ’      ";
+    String gold = "  <junk/>   ! @ and ’";
     Set<String> set = new HashSet<String>();
     set.add("reserved");
     Reader reader = new HTMLStripCharFilter(CharReader.get(new 
StringReader(test)), set);
@@ -264,7 +264,7 @@ public class HTMLStripCharFilterTest extends TestCase {
   public void testComment() throws Exception {
 
     String test = "<!--- three dashes, still a valid comment ---> ";
-    String gold = "                                               ";
+    String gold = "  ";
     Reader reader = new HTMLStripCharFilter(CharReader.get(new 
BufferedReader(new StringReader(test))));//force the use of BufferedReader
     int ch = 0;
     StringBuilder builder = new StringBuilder();

Re: HTML decoder is splitting tokens

Reply via email to