Re: HTML documents with TXT extension

Markus Jelsma Tue, 08 May 2012 05:43:09 -0700

Hi

Nutch should parse an HTML file with a .txt extension just as a normalHTML file, at least, here it does. What does your parserchecker say? Inany case you must strip potential left-over HTML in your Solr analyzer,if left like this it's a bad XSS vulnerability.


Cheers

On Tue, 8 May 2012 08:34:58 -0400, Bai Shen <[email protected]>wrote:

Nutch ended up crawling some HTML files that had a TXT extension.Because
of this(I assume), it didn't strip out the HTML.  So now I have weird
formatting on my results page.
Is there a way to fix this on the Nutch side so it doesn't happenagain?

Re: HTML documents with TXT extension

Reply via email to