RE: Nutch 1.12 get stuck on same document

Markus Jelsma Wed, 01 Feb 2017 02:40:27 -0800

It is probably not the fetcher but the parser that gets stuck on the document. 
The http.content.limit must at least be 18 MB or the parser will die trying to 
parse it. You might also want to take a look at memory consumption, there is a 
good change the JVM gets stuck because of this PDF. Finally, you need 
parser.timeout also to be high enough, but that depends on CPU and available 
heap space.


Markus

 
-----Original message-----
> From:André Schild <[email protected]>
> Sent: Wednesday 1st February 2017 11:28
> To: [email protected]
> Subject: Nutch 1.12 get stuck on same document
> 
> Hello,
> 
> we see a problem where nutch 1.12 gets stuck on a single document.
> We only walk one site, and so only one fetcher is active.
> 
> The document is a https://xxxxxx/824a6f94-aa5f-4dad-8621-5c59add4e7b6.pdf 
> which is ~18MB in size.
> 
> We have these settings:
> 
> http.timeout=60000
> http.content.limit=1412929
> 
> When we start a crawl, then we see this:
> 
> 2017-02-01 10:53:56,924 INFO  fetcher.Fetcher - -activeThreads=50, 
> spinWaiting=49, fetchQueues.totalSize=207, fetchQueues.getQueueCount=1
> .
> .
> .
> Then 5 minutes later
> 2017-02-01 10:58:56,924 WARN  fetcher.Fetcher - Aborting with 50 hung threads.
> 2017-02-01 10:58:56,924 WARN  fetcher.Fetcher - Thread #0 hung while 
> processing https://xxxxxxxxxxx/824a6f94-aa5f-4dad-8621-5c59add4e7b6.pdf
> 
> It then again trys to fetch the very same document, and after 5 minutes again 
> aborts, and so on...
> 
> I could solve the stuck problem with setting:
> 
> mapred.task.timeout= 1200000
> 
> The fetcher continued after ~6.5 Minutes with the next document.
> 
> In debugging I did see that even as I had set a content limit, it was still 
> fetching the whole document via http(s), but somehow used longer than 5 
> minutes to process that fetch.
> A wget from the server command line did retrieve the same pdf in ~0.5 seconds
> 
> I would find it highly interesting, if nutch would mark such fetch timeouts 
> on a specific document/url and continue with the next document/url and retry 
> the failed ones at a later (or random) stage.
> With the actual behavior, the crawl can get stuck indefinitely...
> 
> Any thoughts on this?
> 
> André Schild
> 
> Aarboard AG<http://www.aarboard.ch/>
> Egliweg 10
> 2560 Nidau
> Switzerland
> +41 32 332 97 14<tel:+41323329714>
> 
>

RE: Nutch 1.12 get stuck on same document

Reply via email to