Perhaps something for the Tika list?

On Monday 15 November 2010 17:57:13 Markus Jelsma wrote:
> Hi,
> 
> A quite awful issue just occurred and i traced it back down the line.
> Apparently the parser seems to translate HTML entities back to their
> original form, &lt; to < and &gt; to > etc. This is no problem for
> searching as i strip it away but it gets stored, and it are the stored
> fields that are being used to display the results.
> 
> bin/nutch org.apache.nutch.parse.ParserChecker -dumpText
> http://www.w3schools.com/tags/ref_entities.asp
> 
> As you can see, the original entities become valid HTML elements and will
> be parsed if displayed as part of search results. The question is why the
> entities get translated and how to turn it off. searching the internet or
> the config didn't point me in the right direction.
> 
> Doing some escaping on the front end isn't the solution i'm looking for as
> my highlighting elements will be escaped as well. Escaping there and
> restoring the highlighting elements afterwards is only a temporary
> work-around in this case.
> 
> Cheers,

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536600 / 06-50258350

Reply via email to