Re: Nutch not crawling all URLs

2021-12-13 Thread Ayhan Koyun
Hi Sebastian,

yes that I mean. Do you think there is a way to learn more about,
how to crawl any website?!

>Hi Ayhan,

>you mean?
>https://stackoverflow.com/questions/69352136/nutch-does-not-crawl-sites-that-allows-all-crawler-by-robots-txt



Re: Nutch not crawling all URLs

2021-12-13 Thread Sebastian Nagel
Hi Ayhan,

you mean?
https://stackoverflow.com/questions/69352136/nutch-does-not-crawl-sites-that-allows-all-crawler-by-robots-txt

Sebastian

On 12/13/21 20:59, Ayhan Koyun wrote:
> Hi,
> 
> as I wrote before, it seems that I am not the only one who can not crawl all 
> the seed.txt url's. I couldn't
> find a solution really. I collected 450 domains and approximately 200 nutch 
> will or can not crawl. I want to
> know why this happens, is there a solution to force crawling sites?
> 
> It would be great to get a satisfying answer, to know why this happens and 
> maybe how to solve it.
> 
> Thanks in advance
> 
> Ayhan
> 
> 


RE: Nutch not crawling all URLs

2021-12-13 Thread Roseline Antai
Hi Lewis,

I got a really weird reply back from what I sent, so I thought it better to 
resend the URLs again. I'm unsure if you got the URLs in the first instance.

I've sent them as a text file attachment as well.

http://traivefinance.com
http://www.ceibal.edu.uy
http://www.talovstudio.com
https://portaltelemedicina.com.br/en/telediagnostic-platform
http://www.notco.com
http://www.saiph.org
http://www.1doc3.com
http://www.amanda-care.com
http://www.unimadx.com
http://www.upch.edu.pe/bioinformatic/anemia/app/
http://www.u-planner.com
http://alerce.science
http://paraempleo.mtess.gov.py
http://layers.hemav.com
http://www.sisben.gov.co
http://ialab.com.ar
http://www.kilimo.com.ar
https://www.facebook.com/CIRSYS
http://www.dymaxionlabs.com
http://cedo.org

Regards,
Roseline

Dr Roseline Antai
Research Fellow
Hunter Centre for Entrepreneurship
Strathclyde Business School
University of Strathclyde, Glasgow, UK


The University of Strathclyde is a charitable body, registered in Scotland, 
number SC015263.


-Original Message-
From: lewis john mcgibbney  
Sent: 13 December 2021 17:18
To: user@nutch.apache.org
Subject: Re: Nutch not crawling all URLs

CAUTION: This email originated outside the University. Check before clicking 
links or attachments.

Hi Roseline,
Looks like you are ignoring external URLs... that could be the problem right 
there.
I encourage you to track counters on inject, generate and fetch phases to 
understand where records may be being dropped.
Are the seeds you are using public? If so please post your seed file so we can 
try.
Thank you
lewismc

On Mon, Dec 13, 2021 at 04:02  wrote:

>
> user Digest 13 Dec 2021 12:02:41 - Issue 3132
>
> Topics (messages 34682 through 34682)
>
> Nutch not crawling all URLs
> 34682 by: Roseline Antai
>
> Administrivia:
>
> -
> To post to the list, e-mail: user@nutch.apache.org To unsubscribe, 
> e-mail: user-digest-unsubscr...@nutch.apache.org
> For additional commands, e-mail: user-digest-h...@nutch.apache.org
>
> --
>
>
>
>
> -- Forwarded message --
> From: Roseline Antai 
> To: "user@nutch.apache.org" 
> Cc:
> Bcc:
> Date: Mon, 13 Dec 2021 12:02:26 +
> Subject: Nutch not crawling all URLs
>
> Hi,
>
>
>
> I am working with Apache nutch 1.18 and Solr. I have set up the system 
> successfully, but I'm now having the problem that Nutch is refusing to 
> crawl all the URLs. I am now at a loss as to what I should do to 
> correct this problem. It fetches about half of the URLs in the seed.txt file.
>
>
>
> For instance, when I inject 20 URLs, only 9 are fetched. I have made a 
> number of changes based on the suggestions I saw on the Nutch forum, 
> as well as on Stack overflow, but nothing seems to work.
>
>
>
> This is what my nutch-site.xml file looks like:
>
>
>
>
>
> **
>
> **
>
>
>
> **
>
>
>
> **
>
> **
>
> *http.agent.name 
>  .agent.name%2Fdata=04%7C01%7Croseline.antai%40strath.ac.uk%7C3ed3
> 407a9d4e488c0cc008d9be5c91ff%7C631e0763153347eba5cd0457bee5944e%7C0%7C
> 0%7C637750127056879003%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJ
> QIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000sdata=Or%2Ft4Sp
> S%2BtOnYXTPPXvnlEdYHapSd84pJU4klj9Tkkg%3Dreserved=0>*
>
> *Nutch Crawler*
>
> **
>
> **
>
> *http.agent.email *
>
> *datalake.ng at gmail d *
>
> **
>
> **
>
> *db.ignore.internal.links*
>
> *false*
>
> **
>
> **
>
> *db.ignore.external.links*
>
> *true*
>
> **
>
> **
>
> *  plugin.includes*
>
> *
> protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|an
> chor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)|langu
> age-identifier*
>
> **
>
> **
>
> *parser.skip.truncated*
>
> *false*
>
> *Boolean value for whether we should skip parsing for
> truncated documents. By default this*
>
> *property is activated due to extremely high levels of CPU which
> parsing can sometimes take.*
>
> **
>
> **
>
> **
>
> *   db.max.outlinks.per.page
>  ax.outlinks.per.page%2Fdata=04%7C01%7Croseline.antai%40strath.ac.
> uk%7C3ed3407a9d4e488c0cc008d9be5c91ff%7C631e0763153347eba5cd0457bee594
> 4e%7C0%7C0%7C637750127056879003%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLj
> AwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000sdata=
> hSYwQY8gfRV8uPs5X5jYS4t8Y%2FJ1QEfxykV9Fv183ho%3Dreserved=0> >*
>
> *   -1*
>
> *   The maximum number of outlinks that we'll process for a
> page.*
>
> *   If this value is nonnegative (>=0), at most db.max.outlinks.per.page
>  ax.outlinks.per.page%2Fdata=04%7C01%7Croseline.antai%40strath.ac.
> uk%7C3ed3407a9d4e488c0cc008d9be5c91ff%7C631e0763153347eba5cd0457bee594
> 

Re: Nutch not crawling all URLs

2021-12-13 Thread Ayhan Koyun
Hi,

as I wrote before, it seems that I am not the only one who can not crawl all 
the seed.txt url's. I couldn't
find a solution really. I collected 450 domains and approximately 200 nutch 
will or can not crawl. I want to
know why this happens, is there a solution to force crawling sites?

It would be great to get a satisfying answer, to know why this happens and 
maybe how to solve it.

Thanks in advance

Ayhan



Re: Nutch not crawling all URLs

2021-12-13 Thread Sebastian Nagel
Hi Roseline,

> 5,36405,0,http://www.notco.com

What is the status for   https://notco.com/which is the final redirect
target?
Is the target page indexed?

~Sebastian


RE: Nutch not crawling all URLs

2021-12-13 Thread Roseline Antai
Hi Lewis,

Yes, there are public websites. Below are the 20 test URLs I've been trying to 
crawl.

http://traivefinance.com
http://www.ceibal.edu.uy
http://www.talovstudio.com
https://portaltelemedicina.com.br/en/telediagnostic-platform
http://www.notco.com
http://www.saiph.org
http://www.1doc3.com
http://www.amanda-care.com
http://www.unimadx.com
http://www.upch.edu.pe/bioinformatic/anemia/app/
http://www.u-planner.com
http://alerce.science
http://paraempleo.mtess.gov.py
http://layers.hemav.com
http://www.sisben.gov.co
http://ialab.com.ar
http://www.kilimo.com.ar
https://www.facebook.com/CIRSYS
http://www.dymaxionlabs.com
http://cedo.org


This is a count of the pages for the URLs crawled and not crawled. As can be 
seen, some are very large, while some are '0'.


,Project_id,Document Length,url
0,36400,0, http://www.trapview.com/v2/en/
1,36401,0,http://traivefinance.com
2,36402,2344075,http://www.ceibal.edu.uy
3,36403,35072,http://www.talovstudio.com
4,36404,1384658,https://portaltelemedicina.com.br/en/telediagnostic-platform
5,36405,0,http://www.notco.com
6,36406,0,http://www.saiph.org
7,36407,246009,http://www.1doc3.com
8,36408,43190,http://www.amanda-care.com
9,36409,0,http://www.unimadx.com
10,36410,0,http://www.upch.edu.pe/bioinformatic/anemia/app/
11,36411,0,http://www.u-planner.com
12,36412,8084,http://alerce.science
13,36413,0,http://paraempleo.mtess.gov.py
14,36414,0,http://layers.hemav.com
15,36415,0,http://www.sisben.gov.co
16,36416,3794113,http://ialab.com.ar
17,36417,0,http://www.kilimo.com.ar
18,36418,0,https://www.facebook.com/CIRSYS
19,36419,49062,http://www.dymaxionlabs.com
20,36420,1281267,http://cedo.org


Regards,
Roseline

Dr Roseline Antai
Research Fellow
Hunter Centre for Entrepreneurship
Strathclyde Business School
University of Strathclyde, Glasgow, UK


The University of Strathclyde is a charitable body, registered in Scotland, 
number SC015263.


-Original Message-
From: lewis john mcgibbney  
Sent: 13 December 2021 17:18
To: user@nutch.apache.org
Subject: Re: Nutch not crawling all URLs

CAUTION: This email originated outside the University. Check before clicking 
links or attachments.

Hi Roseline,
Looks like you are ignoring external URLs... that could be the problem right 
there.
I encourage you to track counters on inject, generate and fetch phases to 
understand where records may be being dropped.
Are the seeds you are using public? If so please post your seed file so we can 
try.
Thank you
lewismc

On Mon, Dec 13, 2021 at 04:02  wrote:

>
> user Digest 13 Dec 2021 12:02:41 - Issue 3132
>
> Topics (messages 34682 through 34682)
>
> Nutch not crawling all URLs
> 34682 by: Roseline Antai
>
> Administrivia:
>
> -
> To post to the list, e-mail: user@nutch.apache.org To unsubscribe, 
> e-mail: user-digest-unsubscr...@nutch.apache.org
> For additional commands, e-mail: user-digest-h...@nutch.apache.org
>
> --
>
>
>
>
> -- Forwarded message --
> From: Roseline Antai 
> To: "user@nutch.apache.org" 
> Cc:
> Bcc:
> Date: Mon, 13 Dec 2021 12:02:26 +
> Subject: Nutch not crawling all URLs
>
> Hi,
>
>
>
> I am working with Apache nutch 1.18 and Solr. I have set up the system 
> successfully, but I'm now having the problem that Nutch is refusing to 
> crawl all the URLs. I am now at a loss as to what I should do to 
> correct this problem. It fetches about half of the URLs in the seed.txt file.
>
>
>
> For instance, when I inject 20 URLs, only 9 are fetched. I have made a 
> number of changes based on the suggestions I saw on the Nutch forum, 
> as well as on Stack overflow, but nothing seems to work.
>
>
>
> This is what my nutch-site.xml file looks like:
>
>
>
>
>
> **
>
> **
>
>
>
> **
>
>
>
> **
>
> **
>
> *http.agent.name 
>  .agent.name%2Fdata=04%7C01%7Croseline.antai%40strath.ac.uk%7C3ed3
> 407a9d4e488c0cc008d9be5c91ff%7C631e0763153347eba5cd0457bee5944e%7C0%7C
> 0%7C637750127056879003%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJ
> QIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000sdata=Or%2Ft4Sp
> S%2BtOnYXTPPXvnlEdYHapSd84pJU4klj9Tkkg%3Dreserved=0>*
>
> *Nutch Crawler*
>
> **
>
> **
>
> *http.agent.email *
>
> *datalake.ng at gmail d *
>
> **
>
> **
>
> *db.ignore.internal.links*
>
> *false*
>
> **
>
> **
>
> *db.ignore.external.links*
>
> *true*
>
> **
>
> **
>
> *  plugin.includes*
>
> *
> protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|an
> chor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)|langu
> age-identifier*
>
> **
>
> **
>
> *parser.skip.truncated*
>
> *false*
>
> *Boolean value for whether we should skip parsing for
> truncated documents. By default this*
>
> *property is activated due to extremely high levels of CPU which
> parsing 

Re: Nutch not crawling all URLs

2021-12-13 Thread lewis john mcgibbney
Hi Roseline,
Looks like you are ignoring external URLs… that could be the problem right
there.
I encourage you to track counters on inject, generate and fetch phases to
understand where records may be being dropped.
Are the seeds you are using public? If so please post your seed file so we
can try.
Thank you
lewismc

On Mon, Dec 13, 2021 at 04:02  wrote:

>
> user Digest 13 Dec 2021 12:02:41 - Issue 3132
>
> Topics (messages 34682 through 34682)
>
> Nutch not crawling all URLs
> 34682 by: Roseline Antai
>
> Administrivia:
>
> -
> To post to the list, e-mail: user@nutch.apache.org
> To unsubscribe, e-mail: user-digest-unsubscr...@nutch.apache.org
> For additional commands, e-mail: user-digest-h...@nutch.apache.org
>
> --
>
>
>
>
> -- Forwarded message --
> From: Roseline Antai 
> To: "user@nutch.apache.org" 
> Cc:
> Bcc:
> Date: Mon, 13 Dec 2021 12:02:26 +
> Subject: Nutch not crawling all URLs
>
> Hi,
>
>
>
> I am working with Apache nutch 1.18 and Solr. I have set up the system
> successfully, but I’m now having the problem that Nutch is refusing to
> crawl all the URLs. I am now at a loss as to what I should do to correct
> this problem. It fetches about half of the URLs in the seed.txt file.
>
>
>
> For instance, when I inject 20 URLs, only 9 are fetched. I have made a
> number of changes based on the suggestions I saw on the Nutch forum, as
> well as on Stack overflow, but nothing seems to work.
>
>
>
> This is what my nutch-site.xml file looks like:
>
>
>
>
>
> **
>
> **
>
>
>
> **
>
>
>
> **
>
> **
>
> *http.agent.name *
>
> *Nutch Crawler*
>
> **
>
> **
>
> *http.agent.email *
>
> *datalake.ng at gmail d *
>
> **
>
> **
>
> *db.ignore.internal.links*
>
> *false*
>
> **
>
> **
>
> *db.ignore.external.links*
>
> *true*
>
> **
>
> **
>
> *  plugin.includes*
>
> *
> protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)|language-identifier*
>
> **
>
> **
>
> *parser.skip.truncated*
>
> *false*
>
> *Boolean value for whether we should skip parsing for
> truncated documents. By default this*
>
> *property is activated due to extremely high levels of CPU which
> parsing can sometimes take.*
>
> **
>
> **
>
> **
>
> *   db.max.outlinks.per.page
> *
>
> *   -1*
>
> *   The maximum number of outlinks that we'll process for a
> page.*
>
> *   If this value is nonnegative (>=0), at most db.max.outlinks.per.page
>  outlinks*
>
> *   will be processed for a page; otherwise, all outlinks will be
> processed.*
>
> *   *
>
> **
>
> **
>
> *  http.content.limit*
>
> *  -1*
>
> *  The length limit for downloaded content using the http://*
>
> *  protocol, in bytes. If this value is nonnegative (>=0), content longer*
>
> *  than it will be truncated; otherwise, no truncation at all. Do not*
>
> *  confuse this setting with the file.content.limit setting.*
>
> *  *
>
> **
>
> **
>
> *  db.ignore.external.links.mode*
>
> *  byDomain*
>
> **
>
> **
>
> *  db.injector.overwrite*
>
> *  true*
>
> **
>
> **
>
> *  http.timeout*
>
> *  5*
>
> *  The default network timeout, in
> milliseconds.*
>
> **
>
> **
>
>
>
> Other changes I have made include changing the following in
> nutch-default.xml:
>
>
>
> *property>*
>
> *  http.redirect.max*
>
> *  2*
>
> *  The maximum number of redirects the fetcher will follow
> when*
>
> *  trying to fetch a page. If set to negative or 0, fetcher won't
> immediately*
>
> *  follow redirected URLs, instead it will record them for later fetching.*
>
> *  *
>
> **
>
> 
>
>
>
> **
>
> *  ftp.timeout*
>
> *  10*
>
> **
>
>
>
> **
>
> *  ftp.server.timeout*
>
> *  15*
>
> **
>
>
>
> ***
>
>
>
> *property>*
>
> *  fetcher.server.delay*
>
> *  65.0*
>
> **
>
>
>
> **
>
> *  fetcher.server.min.delay*
>
> *  25.0*
>
> **
>
>
>
> **
>
> * fetcher.max.crawl.delay*
>
> * 70*
>
> * *
>
>
>
> I also commented out the line below in the regex-urlfilter file:
>
>
>
> *# skip URLs containing certain characters as probable queries, etc.*
>
> *-[?*!@=]*
>
>
>
> Nothing seems to work.
>
>
>
> What is it that I’m not doing, or doing wrongly here?
>
>
>
> Regards,
>
> Roseline
>
>
>
> *Dr Roseline Antai*
>
> *Research Fellow*
>
> Hunter Centre for Entrepreneurship
>
> Strathclyde Business School
>
> University of Strathclyde, Glasgow, UK
>
>
>
> [image: Small eMail Sig]
>
> The University of Strathclyde is a charitable body, registered in
> Scotland, number SC015263.
>
>
>
>
>
-- 
http://home.apache.org/~lewismc/
http://people.apache.org/keys/committer/lewismc


Re: Nutch not crawling all URLs

2021-12-13 Thread Sebastian Nagel
Hi,

(looping back to user@nutch - sorry, pressed the wrong reply button)

> Some URLs were denied by robots.txt,
> while a few failed with: Http code=403

That's two ways to signalize that these pages shouldn't be crawled,
HTTP 403 means "Forbidden".

> 3. I looked in CrawlDB and most URLs are in there, but were not
> crawled, so this is something that I find very confusing.

The CrawlDb contains also URLs which failed for various reasons.
That's important in order to avoid that 404s, 403s etc. are retried
again and again.

> I also ran some of the URLs that were not crawled through this -
>  bin/nutch parsechecker -followRedirects -checkRobotsTxt https://myUrl
>
> Some of the URLs that failed were parsed successfully, so I'm really
> confused as to why there are no results for them.
>

The "HTTP 403 Forbidden" could be from a "anti-bot protection" software.
If you run parsechecker at a different time or from a different machine,
and not repeatedly or too often it may succeed.

Best,
Sebastian

On 12/13/21 17:48, Roseline Antai wrote:
> Hi Sebastian,
> 
> Thank you for your reply.
> 
> 1. All URLs were injected, so 20 in total. None was rejected.
> 
> 2. I've had a look at the log files and I can see that some of the URLs could 
> not be fetched because the robot.txt file could not be found. Would this be a 
> reason for why the fetch failed? Is there a way to go around it?
> 
> Some URLs were denied by robots.txt, while a few failed with: Http code=403 
> 
> 3. I looked in CrawlDB and most URLs are in there, but were not crawled, so 
> this is something that I find very confusing.
> 
> I also ran some of the URLs that were not crawled through this - bin/nutch 
> parsechecker -followRedirects -checkRobotsTxt https://myUrl
> 
> Some of the URLs that failed were parsed successfully, so I'm really confused 
> as to why there are no results for them.
> 
> Do you have any suggestions on what I should try?
> 
> Dr Roseline Antai
> Research Fellow
> Hunter Centre for Entrepreneurship
> Strathclyde Business School
> University of Strathclyde, Glasgow, UK
> 
> 
> The University of Strathclyde is a charitable body, registered in Scotland, 
> number SC015263.
> 
> 
> -Original Message-
> From: Sebastian Nagel  
> Sent: 13 December 2021 12:19
> To: Roseline Antai 
> Subject: Re: Nutch not crawling all URLs
> 
> CAUTION: This email originated outside the University. Check before clicking 
> links or attachments.
> 
> Hi Roseline,
> 
>> For instance, when I inject 20 URLs, only 9 are fetched.
> 
> Are there any log messages about the 11 unfetched URLs in the log files.  Try 
> to look for a file "hadoop.log"
> (usually in $NUTCH_HOME/logs/) and look
>  1. how many URLs have been injected.
> There should be a log message
>  ... Total new urls injected: ...
>  2. If all 20 URLs are injected, there should be log
> messages about these URLs from the fetcher:
>  FetcherThread ... fetching ...
> If the fetch fails, there might be a message about
> this.
>  3. Look into the CrawlDb for the missing URLs.
>   bin/nutch readdb .../crawldb -url 
> or
>   bin/nutch readdb .../crawldb -dump ...
> You get the command-line options by calling
>   bin/nutch readdb
> without any arguments
> 
> Alternatively, verify fetching and parsing the URLs by
>   bin/nutch parsechecker -followRedirects -checkRobotsTxt https://myUrl
> 
> 
>> 
>> db.ignore.external.links
>> true
>> 
> 
> Eventually, you want to follow redirects anyway? See
> 
> 
>   db.ignore.also.redirects
>   true
>   If true, the fetcher checks redirects the same way as
>   links when ignoring internal or external links. Set to false to
>   follow redirects despite the values for db.ignore.external.links and
>   db.ignore.internal.links.
>   
> 
> 
> Best,
> Sebastian
> 
> 
> On 12/13/21 13:02, Roseline Antai wrote:
>> Hi,
>>
>>
>>
>> I am working with Apache nutch 1.18 and Solr. I have set up the system 
>> successfully, but I’m now having the problem that Nutch is refusing to 
>> crawl all the URLs. I am now at a loss as to what I should do to 
>> correct this problem. It fetches about half of the URLs in the seed.txt file.
>>
>>
>>
>> For instance, when I inject 20 URLs, only 9 are fetched. I have made a 
>> number of changes based on the suggestions I saw on the Nutch forum, 
>> as well as on Stack overflow, but nothing seems to work.
>>
>>
>>
>> This is what my nutch-site.xml file looks like:
>>
>>
>>
>>
>>
>> //
>>
>> //
>>
>> / /
>>
>> //
>>
>> / /
>>
>> //
>>
>> //
>>
>> /http.agent.name/
>>
>> /Nutch Crawler/
>>
>> //
>>
>> //
>>
>> /http.agent.email /
>>
>> /datalake.ng at gmail d /
>>
>> //
>>
>> //
>>
>> /db.ignore.internal.links/
>>
>> /false/
>>
>> //
>>
>> //
>>
>> /db.ignore.external.links/
>>
>> /true/
>>
>> //
>>
>> //
>>
>> /  plugin.includes/
>>
>> /
>> protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|an
>> 

Re: Nutch not crawling all URLs

2021-12-13 Thread Sebastian Greenholtz
I don't know how I joined this mailing list but please take me off of this
list, I have not used Nutch for a long time.

Thanks!

On Mon, Dec 13, 2021 at 7:03 AM Roseline Antai 
wrote:

> Hi,
>
>
>
> I am working with Apache nutch 1.18 and Solr. I have set up the system
> successfully, but I’m now having the problem that Nutch is refusing to
> crawl all the URLs. I am now at a loss as to what I should do to correct
> this problem. It fetches about half of the URLs in the seed.txt file.
>
>
>
> For instance, when I inject 20 URLs, only 9 are fetched. I have made a
> number of changes based on the suggestions I saw on the Nutch forum, as
> well as on Stack overflow, but nothing seems to work.
>
>
>
> This is what my nutch-site.xml file looks like:
>
>
>
>
>
> **
>
> **
>
>
>
> **
>
>
>
> **
>
> **
>
> *http.agent.name *
>
> *Nutch Crawler*
>
> **
>
> **
>
> *http.agent.email *
>
> *datalake.ng at gmail d *
>
> **
>
> **
>
> *db.ignore.internal.links*
>
> *false*
>
> **
>
> **
>
> *db.ignore.external.links*
>
> *true*
>
> **
>
> **
>
> *  plugin.includes*
>
> *
> protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)|language-identifier*
>
> **
>
> **
>
> *parser.skip.truncated*
>
> *false*
>
> *Boolean value for whether we should skip parsing for
> truncated documents. By default this*
>
> *property is activated due to extremely high levels of CPU which
> parsing can sometimes take.*
>
> **
>
> **
>
> **
>
> *   db.max.outlinks.per.page
> *
>
> *   -1*
>
> *   The maximum number of outlinks that we'll process for a
> page.*
>
> *   If this value is nonnegative (>=0), at most db.max.outlinks.per.page
>  outlinks*
>
> *   will be processed for a page; otherwise, all outlinks will be
> processed.*
>
> *   *
>
> **
>
> **
>
> *  http.content.limit*
>
> *  -1*
>
> *  The length limit for downloaded content using the http://*
>
> *  protocol, in bytes. If this value is nonnegative (>=0), content longer*
>
> *  than it will be truncated; otherwise, no truncation at all. Do not*
>
> *  confuse this setting with the file.content.limit setting.*
>
> *  *
>
> **
>
> **
>
> *  db.ignore.external.links.mode*
>
> *  byDomain*
>
> **
>
> **
>
> *  db.injector.overwrite*
>
> *  true*
>
> **
>
> **
>
> *  http.timeout*
>
> *  5*
>
> *  The default network timeout, in
> milliseconds.*
>
> **
>
> **
>
>
>
> Other changes I have made include changing the following in
> nutch-default.xml:
>
>
>
> *property>*
>
> *  http.redirect.max*
>
> *  2*
>
> *  The maximum number of redirects the fetcher will follow
> when*
>
> *  trying to fetch a page. If set to negative or 0, fetcher won't
> immediately*
>
> *  follow redirected URLs, instead it will record them for later fetching.*
>
> *  *
>
> **
>
> 
>
>
>
> **
>
> *  ftp.timeout*
>
> *  10*
>
> **
>
>
>
> **
>
> *  ftp.server.timeout*
>
> *  15*
>
> **
>
>
>
> ***
>
>
>
> *property>*
>
> *  fetcher.server.delay*
>
> *  65.0*
>
> **
>
>
>
> **
>
> *  fetcher.server.min.delay*
>
> *  25.0*
>
> **
>
>
>
> **
>
> * fetcher.max.crawl.delay*
>
> * 70*
>
> * *
>
>
>
> I also commented out the line below in the regex-urlfilter file:
>
>
>
> *# skip URLs containing certain characters as probable queries, etc.*
>
> *-[?*!@=]*
>
>
>
> Nothing seems to work.
>
>
>
> What is it that I’m not doing, or doing wrongly here?
>
>
>
> Regards,
>
> Roseline
>
>
>
> *Dr Roseline Antai*
>
> *Research Fellow*
>
> Hunter Centre for Entrepreneurship
>
> Strathclyde Business School
>
> University of Strathclyde, Glasgow, UK
>
>
>
> [image: Small eMail Sig]
>
> The University of Strathclyde is a charitable body, registered in
> Scotland, number SC015263.
>
>
>
>
>


Nutch not crawling all URLs

2021-12-13 Thread Roseline Antai
Hi,

I am working with Apache nutch 1.18 and Solr. I have set up the system 
successfully, but I'm now having the problem that Nutch is refusing to crawl 
all the URLs. I am now at a loss as to what I should do to correct this 
problem. It fetches about half of the URLs in the seed.txt file.

For instance, when I inject 20 URLs, only 9 are fetched. I have made a number 
of changes based on the suggestions I saw on the Nutch forum, as well as on 
Stack overflow, but nothing seems to work.

This is what my nutch-site.xml file looks like:









http.agent.name
Nutch Crawler


http.agent.email
datalake.ng at gmail d


db.ignore.internal.links
false


db.ignore.external.links
true


  plugin.includes
  
protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)|language-identifier


parser.skip.truncated
false
Boolean value for whether we should skip parsing for truncated 
documents. By default this
property is activated due to extremely high levels of CPU which parsing 
can sometimes take.



   db.max.outlinks.per.page
   -1
   The maximum number of outlinks that we'll process for a page.
   If this value is nonnegative (>=0), at most db.max.outlinks.per.page outlinks
   will be processed for a page; otherwise, all outlinks will be processed.
   


  http.content.limit
  -1
  The length limit for downloaded content using the http://
  protocol, in bytes. If this value is nonnegative (>=0), content longer
  than it will be truncated; otherwise, no truncation at all. Do not
  confuse this setting with the file.content.limit setting.
  


  db.ignore.external.links.mode
  byDomain


  db.injector.overwrite
  true


  http.timeout
  5
  The default network timeout, in milliseconds.



Other changes I have made include changing the following in nutch-default.xml:

property>
  http.redirect.max
  2
  The maximum number of redirects the fetcher will follow when
  trying to fetch a page. If set to negative or 0, fetcher won't immediately
  follow redirected URLs, instead it will record them for later fetching.
  

**




  ftp.timeout

  10







  ftp.server.timeout

  15



*


property>

  fetcher.server.delay

  65.0







  fetcher.server.min.delay

  25.0







 fetcher.max.crawl.delay

 70



I also commented out the line below in the regex-urlfilter file:


# skip URLs containing certain characters as probable queries, etc.

-[?*!@=]

Nothing seems to work.

What is it that I'm not doing, or doing wrongly here?

Regards,
Roseline

Dr Roseline Antai
Research Fellow
Hunter Centre for Entrepreneurship
Strathclyde Business School
University of Strathclyde, Glasgow, UK

[Small eMail Sig]
The University of Strathclyde is a charitable body, registered in Scotland, 
number SC015263.