Perhaps something for the Tika list?
On Monday 15 November 2010 17:57:13 Markus Jelsma wrote: > Hi, > > A quite awful issue just occurred and i traced it back down the line. > Apparently the parser seems to translate HTML entities back to their > original form, < to < and > to > etc. This is no problem for > searching as i strip it away but it gets stored, and it are the stored > fields that are being used to display the results. > > bin/nutch org.apache.nutch.parse.ParserChecker -dumpText > http://www.w3schools.com/tags/ref_entities.asp > > As you can see, the original entities become valid HTML elements and will > be parsed if displayed as part of search results. The question is why the > entities get translated and how to turn it off. searching the internet or > the config didn't point me in the right direction. > > Doing some escaping on the front end isn't the solution i'm looking for as > my highlighting elements will be escaped as well. Escaping there and > restoring the highlighting elements afterwards is only a temporary > work-around in this case. > > Cheers, -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536600 / 06-50258350

