Hi,

A quite awful issue just occurred and i traced it back down the line. 
Apparently the parser seems to translate HTML entities back to their original 
form, &lt; to < and &gt; to > etc. This is no problem for searching as i strip 
it away but it gets stored, and it are the stored fields that are being used to 
display the results.

bin/nutch org.apache.nutch.parse.ParserChecker -dumpText 
http://www.w3schools.com/tags/ref_entities.asp

As you can see, the original entities become valid HTML elements and will be 
parsed if displayed as part of search results. The question is why the 
entities get translated and how to turn it off. searching the internet or the 
config didn't point me in the right direction. 

Doing some escaping on the front end isn't the solution i'm looking for as my 
highlighting elements will be escaped as well. Escaping there and restoring 
the highlighting elements afterwards is only a temporary work-around in this 
case.

Cheers,
-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536600 / 06-50258350

Reply via email to