Re: HTML decoder is splitting tokens

Koji Sekiguchi Fri, 28 Aug 2009 17:53:15 -0700

Anders,

Thank you for attaching the patch. Sorry again, I don't have
enough time to investigate the patch and the problem you have,
though, I'd like just to recommend that you'd open a JIRA issue
and attach the patch so that I or someone can look into it later.


And I didn't understand this part of your previous mail:

> Adding MappingCharFilterFactory in front of the HTML stripper (so
> that the latter will not see the entity) does work as expected. That
> is, until I try strings like "use &lt;p&gt; to mark a paragraph",
> where the HTML stripper will then remove parts of the actual text.
> So this approach will not work.

Thanks,

Koji

Anders Melchiorsen wrote:

Greetings.

I am moving this issue from the solr-user list. As can be seen in the
messages below, I am having problems with the Solr HTML stripper.

After some investigation, I have found the cause to be that the
stripper is replacing the removed HTML with spaces. This obviously
breaks when the HTML is in the middle of a word, like "G&uuml;nther".

So, without knowing what I was doing, I hacked together a fix that
uses offset correction instead.

That seemed to work, except that closing tags and attributes still
broke the positioning. With even less of a clue, I replaced read()
with next() in the two methods handling those.

Finally, invalid HTML also gave wrong offsets, and I fixed that by
restoring numRead when rolling back the input stream.

At this point I stopped trying to break it, so there may still be more
problems. Or I might have introduced some problem on my own. Anyway, I
have put the three patches at the bottom of this mail, in case
somebody wants to move along with this issue.



Regards,
Anders.



"Anders Melchiorsen" <m...@spoon.kalibalik.dk> writes:

Hello.

Thanks for the hints. Still some trouble, though.

I added just the HTMLStripCharFilterFactory because, according to
documentation, it should also replace HTML entities. It did, but
still left a space after the entity, so I got two tokens from
"G&uuml;nther". That seems like a bug?

Adding MappingCharFilterFactory in front of the HTML stripper (so
that the latter will not see the entity) does work as expected. That
is, until I try strings like "use &lt;p&gt; to mark a paragraph",
where the HTML stripper will then remove parts of the actual text.
So this approach will not work.


Finally, I was happy that I could now use an arbitrary tokenizer
with HTML input. The PatternTokenizer, however, seems to be using
character offsets corresponding to the output of the char filters,
and so the highlighting markers end up at the wrong place. Is that a
bug, or a configuration issue?


Cheers,
Anders.


Koji Sekiguchi wrote:

Hi Anders,

Sorry, I don't know this is a bug or a feature, but
I'd like to show an alternate way if you'd like.

In Solr trunk, HTMLStripWhitespaceTokenizerFactory is
marked as deprecated. Instead, HTMLStripCharFilterFactory and
an arbitrary TokenizerFactory are encouraged to use.
And I'd recommend you to use MappingCharFilterFactory
to convert character references to real characters.
That is, you have:

<fieldType name="textHtml" class="solr.TextField" >
  <analyzer>
    <charFilter class="solr.MappingCharFilterFactory"
mapping="mapping.txt"/>
    <charFilter class="solr.HTMLStripCharFilterFactory"/>
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
  </analyzer>
</fieldType>

where the contents of mapping.txt:

"&uuml;" => "ü"
"&auml;" => "ä"
"&iuml;" => "ï"
"&euml;" => "ë"
"&ouml;" => "ö"
    :             :

Then run analysis.jsp and see the result.

Thank you,

Koji


Anders Melchiorsen wrote:

Hi.

When indexing the string "G&uuml;nther" with
HTMLStripWhitespaceTokenizerFactory (in analysis.jsp), I get two
tokens, "Gü" and "nther".

Is this a bug, or am I doing something wrong?

(Using a Solr nightly from 2009-05-29)


Anders.



commit 1fb2d42181d8effb1b444aa2fa02d86df1d860d7
Author: Anders Melchiorsen <m...@spoon.kalibalik.dk>
Date:   Fri Aug 28 15:57:03 2009 +0200

    Use offset correction instead of inserting spaces into the stream

Fixes "Günther" turning into "Gü nther".


diff --git a/HTMLStripCharFilter.java b/HTMLStripCharFilter.java
index 733d783..e473cef 100644
--- a/HTMLStripCharFilter.java
+++ b/HTMLStripCharFilter.java
@@ -37,7 +37,9 @@ public class HTMLStripCharFilter extends BaseCharFilter {
   private int readAheadLimit = DEFAULT_READ_AHEAD;
   private int safeReadAheadLimit = readAheadLimit - 3;
   private int numWhitespace = 0;
+  private int numWhitespaceCorrected = 0;
   private int numRead = 0;
+  private int numReadLast = 0;
   private int lastMark;
   private Set<String> escapedTags;

@@ -674,9 +676,11 @@ public class HTMLStripCharFilter extends BaseCharFilter {

     // where do we have to worry about them?
     // <![ CDATA [ unescaped markup ]]>
     if (numWhitespace > 0){
-      numWhitespace--;
-      return ' ';
+      addOffCorrectMap(numReadLast+1-numWhitespaceCorrected, 
numWhitespaceCorrected+numWhitespace);
+      numWhitespaceCorrected += numWhitespace;
+      numWhitespace = 0;
     }
+    numReadLast = numRead;
     //do not limit this one by the READAHEAD
     while(true) {
       int lastNumRead = numRead;

commit 542f5734136bbfd72ae802c30b6c61361268bccf
Author: Anders Melchiorsen <m...@spoon.kalibalik.dk>
Date:   Fri Aug 28 15:57:29 2009 +0200

    Use next() in place of read()

The read() method is our public interface, while next()

    is what we use internally to get the next character.

diff --git a/HTMLStripCharFilter.java b/HTMLStripCharFilter.java
index e473cef..ab14de5 100644
--- a/HTMLStripCharFilter.java
+++ b/HTMLStripCharFilter.java
@@ -537,13 +537,13 @@ public class HTMLStripCharFilter extends BaseCharFilter {

private int readName(boolean checkEscaped) throws IOException {

     StringBuilder builder = (checkEscaped && escapedTags!=null) ? new 
StringBuilder() : null;
-    int ch = read();
+    int ch = next();
     if (builder!=null) builder.append((char)ch);
     if (!isFirstIdChar(ch)) return MISMATCH;
-    ch = read();
+    ch = next();
     if (builder!=null) builder.append((char)ch);
     while(isIdChar(ch)) {
-      ch=read();
+      ch = next();
       if (builder!=null) builder.append((char)ch);
     }
     if (ch!=-1) {
@@ -572,11 +572,11 @@ public class HTMLStripCharFilter extends BaseCharFilter {
     //  <a href="a/<!--#echo "path"-->">
     private int readAttr2() throws IOException {
     if ((numRead - lastMark < safeReadAheadLimit)) {
-      int ch = read();
+      int ch = next();
       if (!isFirstIdChar(ch)) return MISMATCH;
-      ch = read();
+      ch = next();
       while(isIdChar(ch) && ((numRead - lastMark) < safeReadAheadLimit)){
-        ch=read();
+        ch = next();
       }
       if (isSpace(ch)) ch = nextSkipWS();

commit fdaa0920e2dceeb33e534138fe4a672914aff0ea
Author: Anders Melchiorsen <m...@spoon.kalibalik.dk>
Date:   Fri Aug 28 16:31:34 2009 +0200

    Restore the numRead variable when rolling back the stream

This fixes offset corrections after invalid HTML input, like

    "hi &<< <b>there</b>".

diff --git a/HTMLStripCharFilter.java b/HTMLStripCharFilter.java
index ab14de5..4bfa85b 100644
--- a/HTMLStripCharFilter.java
+++ b/HTMLStripCharFilter.java
@@ -171,6 +171,7 @@ public class HTMLStripCharFilter extends BaseCharFilter {
   private void restoreState() throws IOException {
     input.reset();
     pushed.setLength(0);
+    numRead = lastMark;
   }

private int readNumericEntity() throws IOException {


commit 571537795af2edb54543db2f71550662b0a18e60
Author: Anders Melchiorsen <a...@gnu.jobsafari.dk>
Date:   Fri Aug 28 23:49:03 2009 +0200

    Update some tests.

diff --git a/HTMLStripCharFilterTest.java b/HTMLStripCharFilterTest.java
index 7be7c7e..4730830 100644
--- a/HTMLStripCharFilterTest.java
+++ b/HTMLStripCharFilterTest.java
@@ -49,9 +49,9 @@ public class HTMLStripCharFilterTest extends TestCase {
     String html = "<div class=\"foo\">this is some text</div> here is a <a 
href=\"#bar\">link</a> and " +
             "another <a href=\"http://lucene.apache.org/\";>link</a>. " +
             "This is an entity: &amp; plus a &lt;.  Here is an &. <!-- is a comment 
-->";
-    String gold = "                 this is some text       here is a               
 link     and " +
-            "another                                     link    . " +
-            "This is an entity: &     plus a <   .  Here is an &.                   
   ";
+    String gold = " this is some text  here is a  link  and " +
+            "another  link . " +
+            "This is an entity: & plus a <.  Here is an &. ";
     HTMLStripCharFilter reader = new HTMLStripCharFilter(CharReader.get(new 
StringReader(html)));
     StringBuilder builder = new StringBuilder();
     int ch = -1;
@@ -87,7 +87,7 @@ public class HTMLStripCharFilterTest extends TestCase {

public void testGamma() throws Exception {

     String test = "&Gamma;";
-    String gold = "\u0393      ";
+    String gold = "\u0393";
     Set<String> set = new HashSet<String>();
     set.add("reserved");
     Reader reader = new HTMLStripCharFilter(CharReader.get(new 
StringReader(test)), set);
@@ -104,7 +104,7 @@ public class HTMLStripCharFilterTest extends TestCase {

public void testEntities() throws Exception {

     String test = "&nbsp; &lt;foo&gt; &#61; &Gamma; bar &#x393;";
-    String gold = "       <   foo>    =     \u0393       bar \u0393     ";
+    String gold = "  <foo> = \u0393 bar \u0393";
     Set<String> set = new HashSet<String>();
     set.add("reserved");
     Reader reader = new HTMLStripCharFilter(CharReader.get(new 
StringReader(test)), set);
@@ -121,7 +121,7 @@ public class HTMLStripCharFilterTest extends TestCase {

public void testMoreEntities() throws Exception {

     String test = "&nbsp; &lt;junk/&gt; &nbsp; &#33; &#64; and &#8217;";
-    String gold = "       <   junk/>           !     @     and ’      ";
+    String gold = "  <junk/>   ! @ and ’";
     Set<String> set = new HashSet<String>();
     set.add("reserved");
     Reader reader = new HTMLStripCharFilter(CharReader.get(new 
StringReader(test)), set);
@@ -264,7 +264,7 @@ public class HTMLStripCharFilterTest extends TestCase {
   public void testComment() throws Exception {

String test = " ";

-    String gold = "                                               ";
+    String gold = "  ";
     Reader reader = new HTMLStripCharFilter(CharReader.get(new 
BufferedReader(new StringReader(test))));//force the use of BufferedReader
     int ch = 0;
     StringBuilder builder = new StringBuilder();

Re: HTML decoder is splitting tokens

Reply via email to