Hello, we see a problem where nutch 1.12 gets stuck on a single document. We only walk one site, and so only one fetcher is active.
The document is a https://xxxxxx/824a6f94-aa5f-4dad-8621-5c59add4e7b6.pdf which is ~18MB in size. We have these settings: http.timeout=60000 http.content.limit=1412929 When we start a crawl, then we see this: 2017-02-01 10:53:56,924 INFO fetcher.Fetcher - -activeThreads=50, spinWaiting=49, fetchQueues.totalSize=207, fetchQueues.getQueueCount=1 . . . Then 5 minutes later 2017-02-01 10:58:56,924 WARN fetcher.Fetcher - Aborting with 50 hung threads. 2017-02-01 10:58:56,924 WARN fetcher.Fetcher - Thread #0 hung while processing https://xxxxxxxxxxx/824a6f94-aa5f-4dad-8621-5c59add4e7b6.pdf It then again trys to fetch the very same document, and after 5 minutes again aborts, and so on... I could solve the stuck problem with setting: mapred.task.timeout= 1200000 The fetcher continued after ~6.5 Minutes with the next document. In debugging I did see that even as I had set a content limit, it was still fetching the whole document via http(s), but somehow used longer than 5 minutes to process that fetch. A wget from the server command line did retrieve the same pdf in ~0.5 seconds I would find it highly interesting, if nutch would mark such fetch timeouts on a specific document/url and continue with the next document/url and retry the failed ones at a later (or random) stage. With the actual behavior, the crawl can get stuck indefinitely... Any thoughts on this? André Schild Aarboard AG<http://www.aarboard.ch/> Egliweg 10 2560 Nidau Switzerland +41 32 332 97 14<tel:+41323329714>

