Greetings. I am moving this issue from the solr-user list. As can be seen in the messages below, I am having problems with the Solr HTML stripper.
After some investigation, I have found the cause to be that the stripper is replacing the removed HTML with spaces. This obviously breaks when the HTML is in the middle of a word, like "Günther". So, without knowing what I was doing, I hacked together a fix that uses offset correction instead. That seemed to work, except that closing tags and attributes still broke the positioning. With even less of a clue, I replaced read() with next() in the two methods handling those. Finally, invalid HTML also gave wrong offsets, and I fixed that by restoring numRead when rolling back the input stream. At this point I stopped trying to break it, so there may still be more problems. Or I might have introduced some problem on my own. Anyway, I have put the three patches at the bottom of this mail, in case somebody wants to move along with this issue. Regards, Anders. "Anders Melchiorsen" <m...@spoon.kalibalik.dk> writes: > Hello. > > Thanks for the hints. Still some trouble, though. > > I added just the HTMLStripCharFilterFactory because, according to > documentation, it should also replace HTML entities. It did, but > still left a space after the entity, so I got two tokens from > "Günther". That seems like a bug? > > Adding MappingCharFilterFactory in front of the HTML stripper (so > that the latter will not see the entity) does work as expected. That > is, until I try strings like "use <p> to mark a paragraph", > where the HTML stripper will then remove parts of the actual text. > So this approach will not work. > > > Finally, I was happy that I could now use an arbitrary tokenizer > with HTML input. The PatternTokenizer, however, seems to be using > character offsets corresponding to the output of the char filters, > and so the highlighting markers end up at the wrong place. Is that a > bug, or a configuration issue? > > > Cheers, > Anders. > > > Koji Sekiguchi wrote: >> Hi Anders, >> >> Sorry, I don't know this is a bug or a feature, but >> I'd like to show an alternate way if you'd like. >> >> In Solr trunk, HTMLStripWhitespaceTokenizerFactory is >> marked as deprecated. Instead, HTMLStripCharFilterFactory and >> an arbitrary TokenizerFactory are encouraged to use. >> And I'd recommend you to use MappingCharFilterFactory >> to convert character references to real characters. >> That is, you have: >> >> <fieldType name="textHtml" class="solr.TextField" > >> <analyzer> >> <charFilter class="solr.MappingCharFilterFactory" >> mapping="mapping.txt"/> >> <charFilter class="solr.HTMLStripCharFilterFactory"/> >> <tokenizer class="solr.WhitespaceTokenizerFactory"/> >> </analyzer> >> </fieldType> >> >> where the contents of mapping.txt: >> >> "ü" => "ü" >> "ä" => "ä" >> "ï" => "ï" >> "ë" => "ë" >> "ö" => "ö" >> : : >> >> Then run analysis.jsp and see the result. >> >> Thank you, >> >> Koji >> >> >> Anders Melchiorsen wrote: >>> Hi. >>> >>> When indexing the string "Günther" with >>> HTMLStripWhitespaceTokenizerFactory (in analysis.jsp), I get two >>> tokens, "Gü" and "nther". >>> >>> Is this a bug, or am I doing something wrong? >>> >>> (Using a Solr nightly from 2009-05-29) >>> >>> >>> Anders. >>> commit 1fb2d42181d8effb1b444aa2fa02d86df1d860d7 Author: Anders Melchiorsen <m...@spoon.kalibalik.dk> Date: Fri Aug 28 15:57:03 2009 +0200 Use offset correction instead of inserting spaces into the stream Fixes "Günther" turning into "Gü nther". diff --git a/HTMLStripCharFilter.java b/HTMLStripCharFilter.java index 733d783..e473cef 100644 --- a/HTMLStripCharFilter.java +++ b/HTMLStripCharFilter.java @@ -37,7 +37,9 @@ public class HTMLStripCharFilter extends BaseCharFilter { private int readAheadLimit = DEFAULT_READ_AHEAD; private int safeReadAheadLimit = readAheadLimit - 3; private int numWhitespace = 0; + private int numWhitespaceCorrected = 0; private int numRead = 0; + private int numReadLast = 0; private int lastMark; private Set<String> escapedTags; @@ -674,9 +676,11 @@ public class HTMLStripCharFilter extends BaseCharFilter { // where do we have to worry about them? // <![ CDATA [ unescaped markup ]]> if (numWhitespace > 0){ - numWhitespace--; - return ' '; + addOffCorrectMap(numReadLast+1-numWhitespaceCorrected, numWhitespaceCorrected+numWhitespace); + numWhitespaceCorrected += numWhitespace; + numWhitespace = 0; } + numReadLast = numRead; //do not limit this one by the READAHEAD while(true) { int lastNumRead = numRead; commit 542f5734136bbfd72ae802c30b6c61361268bccf Author: Anders Melchiorsen <m...@spoon.kalibalik.dk> Date: Fri Aug 28 15:57:29 2009 +0200 Use next() in place of read() The read() method is our public interface, while next() is what we use internally to get the next character. diff --git a/HTMLStripCharFilter.java b/HTMLStripCharFilter.java index e473cef..ab14de5 100644 --- a/HTMLStripCharFilter.java +++ b/HTMLStripCharFilter.java @@ -537,13 +537,13 @@ public class HTMLStripCharFilter extends BaseCharFilter { private int readName(boolean checkEscaped) throws IOException { StringBuilder builder = (checkEscaped && escapedTags!=null) ? new StringBuilder() : null; - int ch = read(); + int ch = next(); if (builder!=null) builder.append((char)ch); if (!isFirstIdChar(ch)) return MISMATCH; - ch = read(); + ch = next(); if (builder!=null) builder.append((char)ch); while(isIdChar(ch)) { - ch=read(); + ch = next(); if (builder!=null) builder.append((char)ch); } if (ch!=-1) { @@ -572,11 +572,11 @@ public class HTMLStripCharFilter extends BaseCharFilter { // <a href="a/<!--#echo "path"-->"> private int readAttr2() throws IOException { if ((numRead - lastMark < safeReadAheadLimit)) { - int ch = read(); + int ch = next(); if (!isFirstIdChar(ch)) return MISMATCH; - ch = read(); + ch = next(); while(isIdChar(ch) && ((numRead - lastMark) < safeReadAheadLimit)){ - ch=read(); + ch = next(); } if (isSpace(ch)) ch = nextSkipWS(); commit fdaa0920e2dceeb33e534138fe4a672914aff0ea Author: Anders Melchiorsen <m...@spoon.kalibalik.dk> Date: Fri Aug 28 16:31:34 2009 +0200 Restore the numRead variable when rolling back the stream This fixes offset corrections after invalid HTML input, like "hi &<< <b>there</b>". diff --git a/HTMLStripCharFilter.java b/HTMLStripCharFilter.java index ab14de5..4bfa85b 100644 --- a/HTMLStripCharFilter.java +++ b/HTMLStripCharFilter.java @@ -171,6 +171,7 @@ public class HTMLStripCharFilter extends BaseCharFilter { private void restoreState() throws IOException { input.reset(); pushed.setLength(0); + numRead = lastMark; } private int readNumericEntity() throws IOException { commit 571537795af2edb54543db2f71550662b0a18e60 Author: Anders Melchiorsen <a...@gnu.jobsafari.dk> Date: Fri Aug 28 23:49:03 2009 +0200 Update some tests. diff --git a/HTMLStripCharFilterTest.java b/HTMLStripCharFilterTest.java index 7be7c7e..4730830 100644 --- a/HTMLStripCharFilterTest.java +++ b/HTMLStripCharFilterTest.java @@ -49,9 +49,9 @@ public class HTMLStripCharFilterTest extends TestCase { String html = "<div class=\"foo\">this is some text</div> here is a <a href=\"#bar\">link</a> and " + "another <a href=\"http://lucene.apache.org/\">link</a>. " + "This is an entity: & plus a <. Here is an &. <!-- is a comment -->"; - String gold = " this is some text here is a link and " + - "another link . " + - "This is an entity: & plus a < . Here is an &. "; + String gold = " this is some text here is a link and " + + "another link . " + + "This is an entity: & plus a <. Here is an &. "; HTMLStripCharFilter reader = new HTMLStripCharFilter(CharReader.get(new StringReader(html))); StringBuilder builder = new StringBuilder(); int ch = -1; @@ -87,7 +87,7 @@ public class HTMLStripCharFilterTest extends TestCase { public void testGamma() throws Exception { String test = "Γ"; - String gold = "\u0393 "; + String gold = "\u0393"; Set<String> set = new HashSet<String>(); set.add("reserved"); Reader reader = new HTMLStripCharFilter(CharReader.get(new StringReader(test)), set); @@ -104,7 +104,7 @@ public class HTMLStripCharFilterTest extends TestCase { public void testEntities() throws Exception { String test = " <foo> = Γ bar Γ"; - String gold = " < foo> = \u0393 bar \u0393 "; + String gold = " <foo> = \u0393 bar \u0393"; Set<String> set = new HashSet<String>(); set.add("reserved"); Reader reader = new HTMLStripCharFilter(CharReader.get(new StringReader(test)), set); @@ -121,7 +121,7 @@ public class HTMLStripCharFilterTest extends TestCase { public void testMoreEntities() throws Exception { String test = " <junk/> ! @ and ’"; - String gold = " < junk/> ! @ and ’ "; + String gold = " <junk/> ! @ and ’"; Set<String> set = new HashSet<String>(); set.add("reserved"); Reader reader = new HTMLStripCharFilter(CharReader.get(new StringReader(test)), set); @@ -264,7 +264,7 @@ public class HTMLStripCharFilterTest extends TestCase { public void testComment() throws Exception { String test = "<!--- three dashes, still a valid comment ---> "; - String gold = " "; + String gold = " "; Reader reader = new HTMLStripCharFilter(CharReader.get(new BufferedReader(new StringReader(test))));//force the use of BufferedReader int ch = 0; StringBuilder builder = new StringBuilder();