Hi Michael,
> What other post-fetch actions are there?
Well, the fetched content is spilled to disk which may also become slow in
pathological cases.
But I think it's more important to analyze what happened with the URLs before.
The logs
should contain a message "fetching ..." for every hanging URL. When does it
happen?
If possible, let us know about
- Nutch version
- environment (local, distributed)
- configuration, esp. if not the default:
mapreduce.task.timeout
fetcher.threads.tlimeout.divisor
http.timeout
and in doubt all other modified
fetcher.*
properties
Is the problem reproducible, or does it happen only sometimes?
Thanks,
Sebastian
On 12/09/2016 04:58 PM, Michael Coffey wrote:
> The property fetcher.parse is false and I pass -noParsing to the fetch
> command. What other post-fetch actions are there?
>
>
> From: Sebastian Nagel <[email protected]>
> To: [email protected]
> Sent: Friday, December 9, 2016 12:58 AM
> Subject: Re: Fetcher "hung while processing"
>
> Hi Michael,
>
> what about the property fetcher.parse ?
>
> The queue is unblocked after a page has been fetched but before parsing.
> If the parser is hanging or one of the post-fetch actions take too long
> it may happen that there are multiple URLs from the same host still in
> process.
>
> Sebastian
>
> On 12/09/2016 02:15 AM, Michael Coffey wrote:
>> I sometimes get a bunch of warning messages that say Thread #x hung while
>> processing <url>
>> Is this just a normal thing to see occasionally, or should I look to find
>> some resolution? I do have an example where the same host shows up on a
>> multitude of these messages, which puzzles me. I think there should be only
>> one thread per host, due to me specifying fetcher.threads.per.queue=1
>> Here is example log showing the first 20 of 50 hung threads. Note that
>> http://shinystat.com and http://fabulous.com show up more than once.
>>
>> 2016-12-09 00:47:29,559 WARN [main] org.apache.nutch.fetcher.Fetcher:
>> Aborting with 50 hung threads.
>> 2016-12-09 00:47:29,560 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread
>> #0 hung while processing
>> https://www.hugedomains.com/domain_search.cfm?catSearch=434
>> 2016-12-09 00:47:29,561 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread
>> #1 hung while processing
>> http://fabulous.com/informationcenter/index.htm?formcode%5Bobjective%5D=&formcode%5Bevent%5D=&formcode%5Bregistrytime%5D=1481233769&formcode%5Bcertificate%5D=dfd737bc4490a09d4786cb0e87a15ba6&formdata%5Bqid%5D=820
>> 2016-12-09 00:47:29,561 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread
>> #2 hung while processing http://shinystat.com/it/pro/info_pro.html
>> 2016-12-09 00:47:29,561 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread
>> #3 hung while processing http://events.stanford.edu/byCategory/13/
>> 2016-12-09 00:47:29,562 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread
>> #4 hung while processing
>> https://www.ladesk.com/pricing/hosted/terms-and-conditions/
>> 2016-12-09 00:47:29,562 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread
>> #5 hung while processing http://shinystat.com/en/opt-out_free.html
>> 2016-12-09 00:47:29,562 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread
>> #6 hung while processing http://shinystat.com/fr/biz/info_biz.html
>> 2016-12-09 00:47:29,562 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread
>> #7 hung while processing
>> http://fabulous.com/informationcenter/index.htm?formcode%5Bobjective%5D=&formcode%5Bevent%5D=&formcode%5Bregistrytime%5D=1481233769&formcode%5Bcertificate%5D=dfd737bc4490a09d4786cb0e87a15ba6&formdata%5Bqid%5D=88
>> 2016-12-09 00:47:29,562 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread
>> #8 hung while processing http://www.youronlinechoices.com/sk/slovnik-pojmov
>> 2016-12-09 00:47:29,562 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread
>> #9 hung while processing https://twitter.com/sakura_ope
>> 2016-12-09 00:47:29,562 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread
>> #10 hung while processing http://europa.eu/european-union/topics/culture_en
>> 2016-12-09 00:47:29,562 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread
>> #11 hung while processing http://www.youronlinechoices.com/ee/opt-out-help
>> 2016-12-09 00:47:29,562 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread
>> #12 hung while processing
>> https://www.hugedomains.com/domain_search.cfm?catSearch=437
>> 2016-12-09 00:47:29,562 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread
>> #13 hung while processing
>> http://hosted.ap.org/dynamic/stories/U/US_OBIT_JOHN_GLENN?SITE=AP&SECTION=HOME&TEMPLATE=DEFAULT
>> 2016-12-09 00:47:29,563 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread
>> #14 hung while processing
>> http://static.fc2.com/sh_css/common/base.css?1200605
>> 2016-12-09 00:47:29,563 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread
>> #15 hung while processing https://www.hugedomains.com/terms.cfm
>> 2016-12-09 00:47:29,563 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread
>> #16 hung while processing https://www.ladesk.com/comparisons/
>> 2016-12-09 00:47:29,563 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread
>> #17 hung while processing http://hu.statcounter.com/features/
>> 2016-12-09 00:47:29,563 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread
>> #18 hung while processing http://europa.eu/european-union/about-eu/working_el
>> 2016-12-09 00:47:29,563 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread
>> #19 hung while processing http://www.atinternet.com/es/recursos/
>> 2016-12-09 00:47:29,563 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread
>> #20 hung while processing http://ietf.org/rfc/rfc2026.txt
>>
>>
>
>
>
>
>