Nutch- not getting all content of page

kneerosh Mon, 22 Apr 2013 07:19:01 -0700

I noticed, when Im crawling a website using Nutch, and indexing it in Solr-
when I search for words in the content of the page- i wasn't getting results
though I get the title and text from the header.
I then took the html source and validated it at http://xmlgrid.net/ and
found that its not well formed- only the head part with the title is getting
recognized and the body is probably not well formed html.


As I cant change the website html, how do I get around this. How do ensure
nutch collects text from the entire page- even if its not well-formed.

Any other reason I can be missing content on a page, though the page is
accessed and indexed? any suggestions.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Nutch-not-getting-all-content-of-page-tp4057870.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Nutch- not getting all content of page

Reply via email to