Re: Help posting question
Hi Sheham, the nutch-site.xml configures mapreduce.task.timeout 1800 1.8 seconds (1800 milliseconds) is very short. The default is 600 seconds or 10 minutes, see [1]. Since Nutch needs to finish fetching before the task timeout applies, threads fetching not quickly enough and still running at the end are killed. I would suggest to keep the property "mapreduce.task.timeout" on its default value. Best, Sebastian [1] https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml#mapreduce.task.timeout On 4/24/24 16:38, Lewis John McGibbney wrote: Hi Sheham, On 2024/04/20 08:47:41 Sheham Izat wrote: The Fetcher job was aborted, does that still mean that it went through the entire list of seed urls? Yes it processed the entire generated segment but the fetcher… * hung on https://disneyland.disney.go.com/, https://api.onlyoffice.com/, https://www.adu.com/ and https://www.lowes.com/ * was denied by robots.txt for https://sourceforge.net/, https://onsclothing.com/, https://kinto-usa.com/, https://twitter.com/, https://www.linkedin.com/, etc. * encountered problems processing some robots.txt files for https://twitter.com/, https://www.trustradius.com/ There may be some other issues encountered buy the fetcher. This is not at all uncommon. The fetcher completed successfully after 7 seconds. You could progress with your crawl. I will go through the mailing list questions. If you need more assistance please let us know. You will find plenty of pointers on this mailing list archive though. lewismc
Re: Help posting question
Hi Sheham, On 2024/04/20 08:47:41 Sheham Izat wrote: > The Fetcher job was aborted, does that still mean that it went through the > entire list of seed urls? Yes it processed the entire generated segment but the fetcher… * hung on https://disneyland.disney.go.com/, https://api.onlyoffice.com/, https://www.adu.com/ and https://www.lowes.com/ * was denied by robots.txt for https://sourceforge.net/, https://onsclothing.com/, https://kinto-usa.com/, https://twitter.com/, https://www.linkedin.com/, etc. * encountered problems processing some robots.txt files for https://twitter.com/, https://www.trustradius.com/ There may be some other issues encountered buy the fetcher. This is not at all uncommon. The fetcher completed successfully after 7 seconds. You could progress with your crawl. > > I will go through the mailing list questions. If you need more assistance please let us know. You will find plenty of pointers on this mailing list archive though. lewismc
Re: Help posting question
Hi Lewis, The Fetcher job was aborted, does that still mean that it went through the entire list of seed urls? I will go through the mailing list questions. Thank you On Fri, Apr 19, 2024 at 10:15 PM Lewis John McGibbney wrote: > Hi Sheham, > > On 2024/04/19 15:18:01 Sheham Izat wrote: > > > > My questions are: > > > > 1) What do I need to do to get Nutch to continue working even if there > are > > hung threads? > > From what I can see in the log you provided, nothing is preventing Nutch > from continuing to work. The Fetcher job finished successfully. > > > 2) Is there a way to avoid having these hanging threads in the first > place? > > Several factors can lead to hung fetcher threads. Lots of questions have > been asked on this mailing list relating to exactly this issue. I would > encourage you to study some of the community responses and see if they > assist you in a better understanding of the possible issues. You can filter > questions in the mailing list search with the following criteria > * date range: more than 1 days ago > * body: hung > > https://lists.apache.org/list.html?user@nutch.apache.org >
Re: Help posting question
Hi Sheham, On 2024/04/19 15:18:01 Sheham Izat wrote: > > My questions are: > > 1) What do I need to do to get Nutch to continue working even if there are > hung threads? >From what I can see in the log you provided, nothing is preventing Nutch from >continuing to work. The Fetcher job finished successfully. > 2) Is there a way to avoid having these hanging threads in the first place? Several factors can lead to hung fetcher threads. Lots of questions have been asked on this mailing list relating to exactly this issue. I would encourage you to study some of the community responses and see if they assist you in a better understanding of the possible issues. You can filter questions in the mailing list search with the following criteria * date range: more than 1 days ago * body: hung https://lists.apache.org/list.html?user@nutch.apache.org
Re: Help posting question
Hi Shashanka, All, Thank you for your reply! I'm using Nutch 1.19. I did the injection and segment generation using the following commands: bin/nutch inject crawl/crawldb urls bin/nutch generate crawl/crawldb crawl/segments When I run the fetch command, Nutch stops with errors about hung threads. I've attached the fetch command output and the nutch-site.xml. s1=`ls -d crawl/segments/2* | tail -1` bin/nutch fetch $s1 My questions are: 1) What do I need to do to get Nutch to continue working even if there are hung threads? 2) Is there a way to avoid having these hanging threads in the first place? Thank you Sheham On Fri, Apr 19, 2024 at 1:04 AM Shashanka Balakuntala < shbalakunt...@gmail.com> wrote: > Hi Shehamizat, > Please feel free to drop questions on the email itself. One of us/community > will be glad to help on the same. > > *Regards* > Shashanka Balakuntala Srinivasa > > > > On Fri, 19 Apr 2024 at 7:15 AM, Sheham Izat wrote: > > > Hi, > > > > I'm trying to get Nutch to work and I have issues, how can I post > questions > > on the group? > > > > Thank you, > > Sheham > > > [root@localhost apache-nutch-1.19]# bin/nutch fetch $s1 SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/opt/apache-nutch-1.19/lib/log4j-slf4j-impl-2.18.0.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/opt/apache-nutch-1.19/lib/slf4j-reload4j-1.7.36.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory] 2024-04-07 22:46:27,222 INFO o.a.n.p.PluginManifestParser [main] Plugins: looking in: /opt/apache-nutch-1.19/plugins 2024-04-07 22:46:27,353 INFO o.a.n.p.PluginRepository [main] Plugin Auto-activation mode: [true] 2024-04-07 22:46:27,354 INFO o.a.n.p.PluginRepository [main] Registered Plugins: 2024-04-07 22:46:27,354 INFO o.a.n.p.PluginRepository [main]Regex URL Filter (urlfilter-regex) 2024-04-07 22:46:27,354 INFO o.a.n.p.PluginRepository [main]Html Parse Plug-in (parse-html) 2024-04-07 22:46:27,354 INFO o.a.n.p.PluginRepository [main]HTTP Framework (lib-http) 2024-04-07 22:46:27,355 INFO o.a.n.p.PluginRepository [main]the nutch core extension points (nutch-extensionpoints) 2024-04-07 22:46:27,355 INFO o.a.n.p.PluginRepository [main]Basic Indexing Filter (index-basic) 2024-04-07 22:46:27,355 INFO o.a.n.p.PluginRepository [main]Anchor Indexing Filter (index-anchor) 2024-04-07 22:46:27,355 INFO o.a.n.p.PluginRepository [main]Tika Parser Plug-in (parse-tika) 2024-04-07 22:46:27,355 INFO o.a.n.p.PluginRepository [main]Basic URL Normalizer (urlnormalizer-basic) 2024-04-07 22:46:27,355 INFO o.a.n.p.PluginRepository [main]Regex URL Filter Framework (lib-regex-filter) 2024-04-07 22:46:27,355 INFO o.a.n.p.PluginRepository [main]Regex URL Normalizer (urlnormalizer-regex) 2024-04-07 22:46:27,355 INFO o.a.n.p.PluginRepository [main]URL Validator (urlfilter-validator) 2024-04-07 22:46:27,355 INFO o.a.n.p.PluginRepository [main]CyberNeko HTML Parser (lib-nekohtml) 2024-04-07 22:46:27,355 INFO o.a.n.p.PluginRepository [main]OPIC Scoring Plug-in (scoring-opic) 2024-04-07 22:46:27,355 INFO o.a.n.p.PluginRepository [main]Pass-through URL Normalizer (urlnormalizer-pass) 2024-04-07 22:46:27,355 INFO o.a.n.p.PluginRepository [main]Http Protocol Plug-in (protocol-http) 2024-04-07 22:46:27,355 INFO o.a.n.p.PluginRepository [main]SolrIndexWriter (indexer-solr) 2024-04-07 22:46:27,355 INFO o.a.n.p.PluginRepository [main] Registered Extension-Points: 2024-04-07 22:46:27,356 INFO o.a.n.p.PluginRepository [main] (Nutch Content Parser) 2024-04-07 22:46:27,356 INFO o.a.n.p.PluginRepository [main] (Nutch URL Filter) 2024-04-07 22:46:27,356 INFO o.a.n.p.PluginRepository [main] (HTML Parse Filter) 2024-04-07 22:46:27,356 INFO o.a.n.p.PluginRepository [main] (Nutch Scoring) 2024-04-07 22:46:27,356 INFO o.a.n.p.PluginRepository [main] (Nutch URL Normalizer) 2024-04-07 22:46:27,356 INFO o.a.n.p.PluginRepository [main] (Nutch Publisher) 2024-04-07 22:46:27,356 INFO o.a.n.p.PluginRepository [main] (Nutch Exchange) 2024-04-07 22:46:27,356 INFO o.a.n.p.PluginRepository [main] (Nutch Protocol) 2024-04-07 22:46:27,356 INFO o.a.n.p.PluginRepository [main] (Nutch URL Ignore Exemption Filter) 2024-04-07 22:46:27,356 INFO o.a.n.p.PluginRepository [main] (Nutch Index Writer) 2024-04-07 22:46:27,356 INFO o.a.n.p.PluginRepository [main] (Nutch Segment Merge Filter) 2024-04-07 22:46:27,356 INFO o.a.n.p.PluginRepository [main] (Nutch Indexing Filter) 2024-04-07 22:46:27,367 INFO o.a.n.f.Fetcher [main] Fetcher: starting at 2024-04-07 22:46:27 2024-04-07 22:46:27,367 INFO o.a.n.f.Fetcher [main] Fetcher: segment: crawl/segments/20240407224534 2024-04-07 22:46:28,109 INFO o.a.n.f.FetchItemQueues
Re: Help posting question
Hi Shehamizat, Please feel free to drop questions on the email itself. One of us/community will be glad to help on the same. *Regards* Shashanka Balakuntala Srinivasa On Fri, 19 Apr 2024 at 7:15 AM, Sheham Izat wrote: > Hi, > > I'm trying to get Nutch to work and I have issues, how can I post questions > on the group? > > Thank you, > Sheham >
Help posting question
Hi, I'm trying to get Nutch to work and I have issues, how can I post questions on the group? Thank you, Sheham