Hi,

I am using Nutch 1.3 in conjunction with Solr 3.3.0 to add search capabilities to an Intranet. The bit that's indexed is fine, though most of them don't seem to be parsed completely. The bottom bit of the page is missing in the content field. Nutch's logs don't show any exceptions.

I turned off parallel fetching (setting fetcher.threads.per.host and fetcher.threads.fetch both to 1), which seemed to improve things, but still I get some pages with are only partially indexed. http.content.limit and file.content.limit are both set to -1.

I tested different settings for the fetcher.server.delay property, but this seems to only affect how long Nutch waits until the next fetch.

What I think is happening is that the web server can't serve the pages fast enough (on accessing the pages via a Browser it takes about 5seconds 'til the page is rendered complete), so Nutch retrieves only the part of the page that has been rendered so far. Is there an option to let Nutch wait a certain amount of time for the page to be completely loaded before parsing the content?

Did anyone encounter already a similar issue? Any pointers appreciated.
Thanks,
Elisabeth

Reply via email to