Re: Help posting question

2024-04-25 Thread Sebastian Nagel

Hi Sheham,

the nutch-site.xml configures

  
mapreduce.task.timeout
1800
  

1.8 seconds (1800 milliseconds) is very short. The default is 600 seconds or 10 
minutes, see [1]. Since Nutch needs to finish fetching before the task timeout 
applies, threads fetching not quickly enough and still running at the end are 
killed.


I would suggest to keep the property "mapreduce.task.timeout" on its default 
value.

Best,
Sebastian

[1] 
https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml#mapreduce.task.timeout


On 4/24/24 16:38, Lewis John McGibbney wrote:

Hi Sheham,

On 2024/04/20 08:47:41 Sheham Izat wrote:


The Fetcher job was aborted, does that still mean that it went through the
entire list of seed urls?


Yes it processed the entire generated segment but the fetcher…

* hung on https://disneyland.disney.go.com/, https://api.onlyoffice.com/,  
https://www.adu.com/ and https://www.lowes.com/
* was denied by robots.txt for https://sourceforge.net/, 
https://onsclothing.com/, https://kinto-usa.com/, https://twitter.com/, 
https://www.linkedin.com/, etc.
* encountered problems processing some robots.txt files for 
https://twitter.com/, https://www.trustradius.com/
There may be some other issues encountered buy the fetcher.

This is not at all uncommon. The fetcher completed successfully after 7 
seconds. You could progress with your crawl.



I will go through the mailing list questions.


If you need more assistance please let us know. You will find plenty of 
pointers on this mailing list archive though.

lewismc


Re: Help posting question

2024-04-24 Thread Lewis John McGibbney
Hi Sheham,

On 2024/04/20 08:47:41 Sheham Izat wrote:

> The Fetcher job was aborted, does that still mean that it went through the
> entire list of seed urls?

Yes it processed the entire generated segment but the fetcher…

* hung on https://disneyland.disney.go.com/, https://api.onlyoffice.com/,  
https://www.adu.com/ and https://www.lowes.com/
* was denied by robots.txt for https://sourceforge.net/, 
https://onsclothing.com/, https://kinto-usa.com/, https://twitter.com/, 
https://www.linkedin.com/, etc.
* encountered problems processing some robots.txt files for 
https://twitter.com/, https://www.trustradius.com/
There may be some other issues encountered buy the fetcher. 

This is not at all uncommon. The fetcher completed successfully after 7 
seconds. You could progress with your crawl.

> 
> I will go through the mailing list questions.

If you need more assistance please let us know. You will find plenty of 
pointers on this mailing list archive though.

lewismc


Re: Help posting question

2024-04-20 Thread Sheham Izat
Hi Lewis,

The Fetcher job was aborted, does that still mean that it went through the
entire list of seed urls?

I will go through the mailing list questions.

Thank you

On Fri, Apr 19, 2024 at 10:15 PM Lewis John McGibbney 
wrote:

> Hi Sheham,
>
> On 2024/04/19 15:18:01 Sheham Izat wrote:
> >
> > My questions are:
> >
> > 1) What do I need to do to get Nutch to continue working even if there
> are
> > hung threads?
>
> From what I can see in the log you provided, nothing is preventing Nutch
> from continuing to work. The Fetcher job finished successfully.
>
> > 2) Is there a way to avoid having these hanging threads in the first
> place?
>
> Several factors can lead to hung fetcher threads. Lots of questions have
> been asked on this mailing list relating to exactly this issue. I would
> encourage you to study some of the community responses and see if they
> assist you in a better understanding of the possible issues. You can filter
> questions in the mailing list search with the following criteria
> * date range: more than 1 days ago
> * body: hung
>
> https://lists.apache.org/list.html?user@nutch.apache.org
>


Re: Help posting question

2024-04-19 Thread Lewis John McGibbney
Hi Sheham,

On 2024/04/19 15:18:01 Sheham Izat wrote:
> 
> My questions are:
> 
> 1) What do I need to do to get Nutch to continue working even if there are
> hung threads?

>From what I can see in the log you provided, nothing is preventing Nutch from 
>continuing to work. The Fetcher job finished successfully.

> 2) Is there a way to avoid having these hanging threads in the first place?

Several factors can lead to hung fetcher threads. Lots of questions have been 
asked on this mailing list relating to exactly this issue. I would encourage 
you to study some of the community responses and see if they assist you in a 
better understanding of the possible issues. You can filter questions in the 
mailing list search with the following criteria
* date range: more than 1 days ago
* body: hung

https://lists.apache.org/list.html?user@nutch.apache.org


Re: Help posting question

2024-04-19 Thread Sheham Izat
Hi Shashanka, All,

Thank you for your reply!

I'm using Nutch 1.19. I did the injection and segment generation using the
following commands:

bin/nutch inject crawl/crawldb urls
bin/nutch generate crawl/crawldb crawl/segments

When I run the fetch command, Nutch stops with errors about hung threads.
I've attached the fetch command output and the nutch-site.xml.

s1=`ls -d crawl/segments/2* | tail -1`
bin/nutch fetch $s1

My questions are:

1) What do I need to do to get Nutch to continue working even if there are
hung threads?
2) Is there a way to avoid having these hanging threads in the first place?

Thank you
Sheham


On Fri, Apr 19, 2024 at 1:04 AM Shashanka Balakuntala <
shbalakunt...@gmail.com> wrote:

> Hi Shehamizat,
> Please feel free to drop questions on the email itself. One of us/community
> will be glad to help on the same.
>
> *Regards*
>   Shashanka Balakuntala Srinivasa
>
>
>
> On Fri, 19 Apr 2024 at 7:15 AM, Sheham Izat  wrote:
>
> > Hi,
> >
> > I'm trying to get Nutch to work and I have issues, how can I post
> questions
> > on the group?
> >
> > Thank you,
> > Sheham
> >
>
[root@localhost apache-nutch-1.19]# bin/nutch fetch $s1
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in 
[jar:file:/opt/apache-nutch-1.19/lib/log4j-slf4j-impl-2.18.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in 
[jar:file:/opt/apache-nutch-1.19/lib/slf4j-reload4j-1.7.36.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
2024-04-07 22:46:27,222 INFO o.a.n.p.PluginManifestParser [main] Plugins: 
looking in: /opt/apache-nutch-1.19/plugins
2024-04-07 22:46:27,353 INFO o.a.n.p.PluginRepository [main] Plugin 
Auto-activation mode: [true]
2024-04-07 22:46:27,354 INFO o.a.n.p.PluginRepository [main] Registered Plugins:
2024-04-07 22:46:27,354 INFO o.a.n.p.PluginRepository [main]Regex URL 
Filter (urlfilter-regex)
2024-04-07 22:46:27,354 INFO o.a.n.p.PluginRepository [main]Html Parse 
Plug-in (parse-html)
2024-04-07 22:46:27,354 INFO o.a.n.p.PluginRepository [main]HTTP Framework 
(lib-http)
2024-04-07 22:46:27,355 INFO o.a.n.p.PluginRepository [main]the nutch core 
extension points (nutch-extensionpoints)
2024-04-07 22:46:27,355 INFO o.a.n.p.PluginRepository [main]Basic Indexing 
Filter (index-basic)
2024-04-07 22:46:27,355 INFO o.a.n.p.PluginRepository [main]Anchor Indexing 
Filter (index-anchor)
2024-04-07 22:46:27,355 INFO o.a.n.p.PluginRepository [main]Tika Parser 
Plug-in (parse-tika)
2024-04-07 22:46:27,355 INFO o.a.n.p.PluginRepository [main]Basic URL 
Normalizer (urlnormalizer-basic)
2024-04-07 22:46:27,355 INFO o.a.n.p.PluginRepository [main]Regex URL 
Filter Framework (lib-regex-filter)
2024-04-07 22:46:27,355 INFO o.a.n.p.PluginRepository [main]Regex URL 
Normalizer (urlnormalizer-regex)
2024-04-07 22:46:27,355 INFO o.a.n.p.PluginRepository [main]URL Validator 
(urlfilter-validator)
2024-04-07 22:46:27,355 INFO o.a.n.p.PluginRepository [main]CyberNeko HTML 
Parser (lib-nekohtml)
2024-04-07 22:46:27,355 INFO o.a.n.p.PluginRepository [main]OPIC Scoring 
Plug-in (scoring-opic)
2024-04-07 22:46:27,355 INFO o.a.n.p.PluginRepository [main]Pass-through 
URL Normalizer (urlnormalizer-pass)
2024-04-07 22:46:27,355 INFO o.a.n.p.PluginRepository [main]Http Protocol 
Plug-in (protocol-http)
2024-04-07 22:46:27,355 INFO o.a.n.p.PluginRepository [main]SolrIndexWriter 
(indexer-solr)
2024-04-07 22:46:27,355 INFO o.a.n.p.PluginRepository [main] Registered 
Extension-Points:
2024-04-07 22:46:27,356 INFO o.a.n.p.PluginRepository [main] (Nutch Content 
Parser)
2024-04-07 22:46:27,356 INFO o.a.n.p.PluginRepository [main] (Nutch URL 
Filter)
2024-04-07 22:46:27,356 INFO o.a.n.p.PluginRepository [main] (HTML Parse 
Filter)
2024-04-07 22:46:27,356 INFO o.a.n.p.PluginRepository [main] (Nutch Scoring)
2024-04-07 22:46:27,356 INFO o.a.n.p.PluginRepository [main] (Nutch URL 
Normalizer)
2024-04-07 22:46:27,356 INFO o.a.n.p.PluginRepository [main] (Nutch 
Publisher)
2024-04-07 22:46:27,356 INFO o.a.n.p.PluginRepository [main] (Nutch 
Exchange)
2024-04-07 22:46:27,356 INFO o.a.n.p.PluginRepository [main] (Nutch 
Protocol)
2024-04-07 22:46:27,356 INFO o.a.n.p.PluginRepository [main] (Nutch URL 
Ignore Exemption Filter)
2024-04-07 22:46:27,356 INFO o.a.n.p.PluginRepository [main] (Nutch Index 
Writer)
2024-04-07 22:46:27,356 INFO o.a.n.p.PluginRepository [main] (Nutch Segment 
Merge Filter)
2024-04-07 22:46:27,356 INFO o.a.n.p.PluginRepository [main] (Nutch 
Indexing Filter)
2024-04-07 22:46:27,367 INFO o.a.n.f.Fetcher [main] Fetcher: starting at 
2024-04-07 22:46:27
2024-04-07 22:46:27,367 INFO o.a.n.f.Fetcher [main] Fetcher: segment: 
crawl/segments/20240407224534
2024-04-07 22:46:28,109 INFO o.a.n.f.FetchItemQueues 

Re: Help posting question

2024-04-18 Thread Shashanka Balakuntala
Hi Shehamizat,
Please feel free to drop questions on the email itself. One of us/community
will be glad to help on the same.

*Regards*
  Shashanka Balakuntala Srinivasa



On Fri, 19 Apr 2024 at 7:15 AM, Sheham Izat  wrote:

> Hi,
>
> I'm trying to get Nutch to work and I have issues, how can I post questions
> on the group?
>
> Thank you,
> Sheham
>


Help posting question

2024-04-18 Thread Sheham Izat
Hi,

I'm trying to get Nutch to work and I have issues, how can I post questions
on the group?

Thank you,
Sheham