Hi

Nutch should parse an HTML file with a .txt extension just as a normal HTML file, at least, here it does. What does your parserchecker say? In any case you must strip potential left-over HTML in your Solr analyzer, if left like this it's a bad XSS vulnerability.

Cheers

On Tue, 8 May 2012 08:34:58 -0400, Bai Shen <[email protected]> wrote:
Nutch ended up crawling some HTML files that had a TXT extension. Because
of this(I assume), it didn't strip out the HTML.  So now I have weird
formatting on my results page.

Is there a way to fix this on the Nutch side so it doesn't happen again?

Reply via email to