Hi
Nutch should parse an HTML file with a .txt extension just as a normal
HTML file, at least, here it does. What does your parserchecker say? In
any case you must strip potential left-over HTML in your Solr analyzer,
if left like this it's a bad XSS vulnerability.
Cheers
On Tue, 8 May 2012 08:34:58 -0400, Bai Shen <[email protected]>
wrote:
Nutch ended up crawling some HTML files that had a TXT extension.
Because
of this(I assume), it didn't strip out the HTML. So now I have weird
formatting on my results page.
Is there a way to fix this on the Nutch side so it doesn't happen
again?