Greetings.
I am moving this issue from the solr-user list. As can be seen in the
messages below, I am having problems with the Solr HTML stripper.
After some investigation, I have found the cause to be that the
stripper is replacing the removed HTML with spaces. This obviously
breaks when the HTML is in the middle of a word, like "Günther".
So, without knowing what I was doing, I hacked together a fix that
uses offset correction instead.
That seemed to work, except that closing tags and attributes still
broke the positioning. With even less of a clue, I replaced read()
with next() in the two methods handling those.
Finally, invalid HTML also gave wrong offsets, and I fixed that by
restoring numRead when rolling back the input stream.
At this point I stopped trying to break it, so there may still be more
problems. Or I might have introduced some problem on my own. Anyway, I
have put the three patches at the bottom of this mail, in case
somebody wants to move along with this issue.
Regards,
Anders.
"Anders Melchiorsen" <m...@spoon.kalibalik.dk> writes:
Hello.
Thanks for the hints. Still some trouble, though.
I added just the HTMLStripCharFilterFactory because, according to
documentation, it should also replace HTML entities. It did, but
still left a space after the entity, so I got two tokens from
"Günther". That seems like a bug?
Adding MappingCharFilterFactory in front of the HTML stripper (so
that the latter will not see the entity) does work as expected. That
is, until I try strings like "use <p> to mark a paragraph",
where the HTML stripper will then remove parts of the actual text.
So this approach will not work.
Finally, I was happy that I could now use an arbitrary tokenizer
with HTML input. The PatternTokenizer, however, seems to be using
character offsets corresponding to the output of the char filters,
and so the highlighting markers end up at the wrong place. Is that a
bug, or a configuration issue?
Cheers,
Anders.
Koji Sekiguchi wrote:
Hi Anders,
Sorry, I don't know this is a bug or a feature, but
I'd like to show an alternate way if you'd like.
In Solr trunk, HTMLStripWhitespaceTokenizerFactory is
marked as deprecated. Instead, HTMLStripCharFilterFactory and
an arbitrary TokenizerFactory are encouraged to use.
And I'd recommend you to use MappingCharFilterFactory
to convert character references to real characters.
That is, you have:
<fieldType name="textHtml" class="solr.TextField" >
<analyzer>
<charFilter class="solr.MappingCharFilterFactory"
mapping="mapping.txt"/>
<charFilter class="solr.HTMLStripCharFilterFactory"/>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
</analyzer>
</fieldType>
where the contents of mapping.txt:
"ü" => "ü"
"ä" => "ä"
"ï" => "ï"
"ë" => "ë"
"ö" => "ö"
: :
Then run analysis.jsp and see the result.
Thank you,
Koji
Anders Melchiorsen wrote:
Hi.
When indexing the string "Günther" with
HTMLStripWhitespaceTokenizerFactory (in analysis.jsp), I get two
tokens, "Gü" and "nther".
Is this a bug, or am I doing something wrong?
(Using a Solr nightly from 2009-05-29)
Anders.
commit 1fb2d42181d8effb1b444aa2fa02d86df1d860d7
Author: Anders Melchiorsen <m...@spoon.kalibalik.dk>
Date: Fri Aug 28 15:57:03 2009 +0200
Use offset correction instead of inserting spaces into the stream
Fixes "Günther" turning into "Gü nther".
diff --git a/HTMLStripCharFilter.java b/HTMLStripCharFilter.java
index 733d783..e473cef 100644
--- a/HTMLStripCharFilter.java
+++ b/HTMLStripCharFilter.java
@@ -37,7 +37,9 @@ public class HTMLStripCharFilter extends BaseCharFilter {
private int readAheadLimit = DEFAULT_READ_AHEAD;
private int safeReadAheadLimit = readAheadLimit - 3;
private int numWhitespace = 0;
+ private int numWhitespaceCorrected = 0;
private int numRead = 0;
+ private int numReadLast = 0;
private int lastMark;
private Set<String> escapedTags;
@@ -674,9 +676,11 @@ public class HTMLStripCharFilter extends BaseCharFilter {
// where do we have to worry about them?
// <![ CDATA [ unescaped markup ]]>
if (numWhitespace > 0){
- numWhitespace--;
- return ' ';
+ addOffCorrectMap(numReadLast+1-numWhitespaceCorrected,
numWhitespaceCorrected+numWhitespace);
+ numWhitespaceCorrected += numWhitespace;
+ numWhitespace = 0;
}
+ numReadLast = numRead;
//do not limit this one by the READAHEAD
while(true) {
int lastNumRead = numRead;
commit 542f5734136bbfd72ae802c30b6c61361268bccf
Author: Anders Melchiorsen <m...@spoon.kalibalik.dk>
Date: Fri Aug 28 15:57:29 2009 +0200
Use next() in place of read()
The read() method is our public interface, while next()
is what we use internally to get the next character.
diff --git a/HTMLStripCharFilter.java b/HTMLStripCharFilter.java
index e473cef..ab14de5 100644
--- a/HTMLStripCharFilter.java
+++ b/HTMLStripCharFilter.java
@@ -537,13 +537,13 @@ public class HTMLStripCharFilter extends BaseCharFilter {
private int readName(boolean checkEscaped) throws IOException {
StringBuilder builder = (checkEscaped && escapedTags!=null) ? new
StringBuilder() : null;
- int ch = read();
+ int ch = next();
if (builder!=null) builder.append((char)ch);
if (!isFirstIdChar(ch)) return MISMATCH;
- ch = read();
+ ch = next();
if (builder!=null) builder.append((char)ch);
while(isIdChar(ch)) {
- ch=read();
+ ch = next();
if (builder!=null) builder.append((char)ch);
}
if (ch!=-1) {
@@ -572,11 +572,11 @@ public class HTMLStripCharFilter extends BaseCharFilter {
// <a href="a/<!--#echo "path"-->">
private int readAttr2() throws IOException {
if ((numRead - lastMark < safeReadAheadLimit)) {
- int ch = read();
+ int ch = next();
if (!isFirstIdChar(ch)) return MISMATCH;
- ch = read();
+ ch = next();
while(isIdChar(ch) && ((numRead - lastMark) < safeReadAheadLimit)){
- ch=read();
+ ch = next();
}
if (isSpace(ch)) ch = nextSkipWS();
commit fdaa0920e2dceeb33e534138fe4a672914aff0ea
Author: Anders Melchiorsen <m...@spoon.kalibalik.dk>
Date: Fri Aug 28 16:31:34 2009 +0200
Restore the numRead variable when rolling back the stream
This fixes offset corrections after invalid HTML input, like
"hi &<< <b>there</b>".
diff --git a/HTMLStripCharFilter.java b/HTMLStripCharFilter.java
index ab14de5..4bfa85b 100644
--- a/HTMLStripCharFilter.java
+++ b/HTMLStripCharFilter.java
@@ -171,6 +171,7 @@ public class HTMLStripCharFilter extends BaseCharFilter {
private void restoreState() throws IOException {
input.reset();
pushed.setLength(0);
+ numRead = lastMark;
}
private int readNumericEntity() throws IOException {
commit 571537795af2edb54543db2f71550662b0a18e60
Author: Anders Melchiorsen <a...@gnu.jobsafari.dk>
Date: Fri Aug 28 23:49:03 2009 +0200
Update some tests.
diff --git a/HTMLStripCharFilterTest.java b/HTMLStripCharFilterTest.java
index 7be7c7e..4730830 100644
--- a/HTMLStripCharFilterTest.java
+++ b/HTMLStripCharFilterTest.java
@@ -49,9 +49,9 @@ public class HTMLStripCharFilterTest extends TestCase {
String html = "<div class=\"foo\">this is some text</div> here is a <a
href=\"#bar\">link</a> and " +
"another <a href=\"http://lucene.apache.org/\">link</a>. " +
"This is an entity: & plus a <. Here is an &. <!-- is a comment
-->";
- String gold = " this is some text here is a
link and " +
- "another link . " +
- "This is an entity: & plus a < . Here is an &.
";
+ String gold = " this is some text here is a link and " +
+ "another link . " +
+ "This is an entity: & plus a <. Here is an &. ";
HTMLStripCharFilter reader = new HTMLStripCharFilter(CharReader.get(new
StringReader(html)));
StringBuilder builder = new StringBuilder();
int ch = -1;
@@ -87,7 +87,7 @@ public class HTMLStripCharFilterTest extends TestCase {
public void testGamma() throws Exception {
String test = "Γ";
- String gold = "\u0393 ";
+ String gold = "\u0393";
Set<String> set = new HashSet<String>();
set.add("reserved");
Reader reader = new HTMLStripCharFilter(CharReader.get(new
StringReader(test)), set);
@@ -104,7 +104,7 @@ public class HTMLStripCharFilterTest extends TestCase {
public void testEntities() throws Exception {
String test = " <foo> = Γ bar Γ";
- String gold = " < foo> = \u0393 bar \u0393 ";
+ String gold = " <foo> = \u0393 bar \u0393";
Set<String> set = new HashSet<String>();
set.add("reserved");
Reader reader = new HTMLStripCharFilter(CharReader.get(new
StringReader(test)), set);
@@ -121,7 +121,7 @@ public class HTMLStripCharFilterTest extends TestCase {
public void testMoreEntities() throws Exception {
String test = " <junk/> ! @ and ’";
- String gold = " < junk/> ! @ and ’ ";
+ String gold = " <junk/> ! @ and ’";
Set<String> set = new HashSet<String>();
set.add("reserved");
Reader reader = new HTMLStripCharFilter(CharReader.get(new
StringReader(test)), set);
@@ -264,7 +264,7 @@ public class HTMLStripCharFilterTest extends TestCase {
public void testComment() throws Exception {
String test = "<!--- three dashes, still a valid comment ---> ";
- String gold = " ";
+ String gold = " ";
Reader reader = new HTMLStripCharFilter(CharReader.get(new
BufferedReader(new StringReader(test))));//force the use of BufferedReader
int ch = 0;
StringBuilder builder = new StringBuilder();