Partially parsed pages

Elisabeth Adler Fri, 23 Mar 2012 05:40:19 -0700

Hi,

I am using Nutch 1.3 in conjunction with Solr 3.3.0 to add searchcapabilities to an Intranet. The bit that's indexed is fine, though mostof them don't seem to be parsed completely. The bottom bit of the pageis missing in the content field. Nutch's logs don't show any exceptions.

I turned off parallel fetching (setting fetcher.threads.per.host andfetcher.threads.fetch both to 1), which seemed to improve things, butstill I get some pages with are only partially indexed.http.content.limit and file.content.limit are both set to -1.

I tested different settings for the fetcher.server.delay property, butthis seems to only affect how long Nutch waits until the next fetch.

What I think is happening is that the web server can't serve the pagesfast enough (on accessing the pages via a Browser it takes about5seconds 'til the page is rendered complete), so Nutch retrieves onlythe part of the page that has been rendered so far. Is there an optionto let Nutch wait a certain amount of time for the page to be completelyloaded before parsing the content?


Did anyone encounter already a similar issue? Any pointers appreciated.
Thanks,
Elisabeth

Partially parsed pages

Reply via email to