Hello,

we see a problem where nutch 1.12 gets stuck on a single document.
We only walk one site, and so only one fetcher is active.

The document is a https://xxxxxx/824a6f94-aa5f-4dad-8621-5c59add4e7b6.pdf which 
is ~18MB in size.

We have these settings:

http.timeout=60000
http.content.limit=1412929

When we start a crawl, then we see this:

2017-02-01 10:53:56,924 INFO  fetcher.Fetcher - -activeThreads=50, 
spinWaiting=49, fetchQueues.totalSize=207, fetchQueues.getQueueCount=1
.
.
.
Then 5 minutes later
2017-02-01 10:58:56,924 WARN  fetcher.Fetcher - Aborting with 50 hung threads.
2017-02-01 10:58:56,924 WARN  fetcher.Fetcher - Thread #0 hung while processing 
https://xxxxxxxxxxx/824a6f94-aa5f-4dad-8621-5c59add4e7b6.pdf

It then again trys to fetch the very same document, and after 5 minutes again 
aborts, and so on...

I could solve the stuck problem with setting:

mapred.task.timeout= 1200000

The fetcher continued after ~6.5 Minutes with the next document.

In debugging I did see that even as I had set a content limit, it was still 
fetching the whole document via http(s), but somehow used longer than 5 minutes 
to process that fetch.
A wget from the server command line did retrieve the same pdf in ~0.5 seconds

I would find it highly interesting, if nutch would mark such fetch timeouts on 
a specific document/url and continue with the next document/url and retry the 
failed ones at a later (or random) stage.
With the actual behavior, the crawl can get stuck indefinitely...

Any thoughts on this?

André Schild

Aarboard AG<http://www.aarboard.ch/>
Egliweg 10
2560 Nidau
Switzerland
+41 32 332 97 14<tel:+41323329714>

Reply via email to