dockerized Nutch crawl doesn't end

Filip Stysiak Thu, 10 Aug 2017 08:11:48 -0700

Hi everyone;

I am developing and app that needs to have dockerized Nutch 1.X instances
and gets crawl requests from Celery and indexes it to Solr 6.6.0. The app
indexes images (using protocol-selenium plugin to fetch dynamic content).
However, I noticed that whereas small crawl tasks are properly indexed I
had no success with a slightly bigger query - when I asked my dockerized
app to crawl a website that (after 3 iterations of the crawl script) needs
to fetch ~5000 links the Nutch in the Docker container just stops to work -
the last thing I see in hadoop.log are from the fetcher; there are no
exceptions, however, save for an exception that does not occur when I run
(successfully) the very same crawling task on the host machine.


the exception (pastebin to full exception):
org.apache.commons.httpclient.NoHttpResponseException: The server
some.site.web failed to respond
https://pastebin.com/nNs7DP93

I doubt that failing to fetch a couple of links would put Nutch in this
crashed-but-not-really state. I say "not really", because Celery still sees
the task as active - but when I look at htop or *docker stats *it's quite
obvious that nutch ceased to do anything productive. Let me restate that
this doesn't occur when I run the task outside of Docker.

Has anyone here stumbled upon anything similair, or has any experience with
running bigger crawls on dockerized Nutch?

Thanks in advance,
Filip

dockerized Nutch crawl doesn't end

Reply via email to