Hi,
I am using Nutch 1.3 in conjunction with Solr 3.3.0 to add search
capabilities to an Intranet. The bit that's indexed is fine, though most
of them don't seem to be parsed completely. The bottom bit of the page
is missing in the content field. Nutch's logs don't show any exceptions.
I turned off parallel fetching (setting fetcher.threads.per.host and
fetcher.threads.fetch both to 1), which seemed to improve things, but
still I get some pages with are only partially indexed.
http.content.limit and file.content.limit are both set to -1.
I tested different settings for the fetcher.server.delay property, but
this seems to only affect how long Nutch waits until the next fetch.
What I think is happening is that the web server can't serve the pages
fast enough (on accessing the pages via a Browser it takes about
5seconds 'til the page is rendered complete), so Nutch retrieves only
the part of the page that has been rendered so far. Is there an option
to let Nutch wait a certain amount of time for the page to be completely
loaded before parsing the content?
Did anyone encounter already a similar issue? Any pointers appreciated.
Thanks,
Elisabeth