I noticed, when Im crawling a website using Nutch, and indexing it in Solr- when I search for words in the content of the page- i wasn't getting results though I get the title and text from the header. I then took the html source and validated it at http://xmlgrid.net/ and found that its not well formed- only the head part with the title is getting recognized and the body is probably not well formed html.
As I cant change the website html, how do I get around this. How do ensure nutch collects text from the entire page- even if its not well-formed. Any other reason I can be missing content on a page, though the page is accessed and indexed? any suggestions. -- View this message in context: http://lucene.472066.n3.nabble.com/Nutch-not-getting-all-content-of-page-tp4057870.html Sent from the Nutch - User mailing list archive at Nabble.com.

