subject:"Nutch not crawling all URLs"

RE: Nutch not crawling all URLs

2022-02-16 Thread Roseline Antai

479 ERROR tika.TikaParser - Problem 
loading custom Tika configuration from tika-config.xml
java.lang.NumberFormatException: For input string: ""

2022-02-15 13:36:25,540 INFO  parse.ParserFactory - The parsing plugins: 
[org.apache.nutch.parse.tika.TikaParser] are enabled via the plugin.includes 
system property, and >2022-02-15 13:36:25,540 ERROR tika.TikaParser - Can't 
retrieve Tika parser for mime-type application/x-bibtex-text-file
2022-02-15 13:36:25,542 WARN  parse.ParseSegment - Error parsing: 
http://www.saiph.org/docs/loco.bibtex: failed(2,0): Can't retrieve Tika parser 
for mime-type application/>

2022-02-15 13:36:26,374 ERROR tika.TikaParser - Can't retrieve Tika parser for 
mime-type application/javascript


Regards,
Roseline


The University of Strathclyde is a charitable body, registered in Scotland, 
number SC015263.


-Original Message-
From: Roseline Antai  
Sent: 13 January 2022 17:02
To: user@nutch.apache.org; Sebastian Nagel 
Subject: RE: Nutch not crawling all URLs

Thank you Sebastian.

I will try these.

Kind regards,
Roseline



-Original Message-
From: Sebastian Nagel 
Sent: 13 January 2022 12:33
To: user@nutch.apache.org
Subject: Re: Nutch not crawling all URLs

Hi Roseline,

> Does it work at all with Chrome?

Yes.

> It seems you need to have some form of GUI to run it?

You need graphics libraries but not necessarily a graphical system.
Normally, you run the browser in headless mode without a graphical device 
(monitor) attached.

> Is there some documentation or tutorial on this?

The README is probably the best documentation:
  src/plugin/protocol-selenium/README.md
  
https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fnutch%2Ftree%2Fmaster%2Fsrc%2Fplugin%2Fprotocol-seleniumdata=04%7C01%7Croseline.antai%40strath.ac.uk%7Cf62dcc933fcf4587d5b308d9d6b67c71%7C631e0763153347eba5cd0457bee5944e%7C0%7C0%7C637776902655981791%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000sdata=nwbGFCjop9xYCzUvojH%2F0wFhwIla1ilLjD9iVGrn4Nc%3Dreserved=0

After installing chromium and the Selenium chromedriver, you can test whether 
it works by running:

bin/nutch parsechecker \
  -Dplugin.includes='protocol-selenium|parse-tika' \
  -Dselenium.grid.binary=/path/to/selenium/chromedriver  \
  -Dselenium.driver=chrome \
  -Dselenium.enable.headless=true \
  -followRedirects -dumpText  URL


Caveat: because browsers are updated frequently, you may need to use a recent 
driver version and eventually also upgrade the Selenium dependencies in Nutch.
Let us know if you need help here.


> My use case is Text mining  and Machine Learning classification. I'm 
> indexing into Solr and then transferring the indexed data to MongoDB 
> for further processing.

Well, that's not an untypical use case for Nutch. And it's a long pipeline:
fetching, HTML parsing, extracting content fields, indexing. Nutch is able to 
perform all steps. But I'd agree that browser-based crawling isn't that easy to 
set up with Nutch.

Best,
Sebastian

On 1/12/22 17:53, Roseline Antai wrote:
> Hi Sebastian,
> 
> Thank you. I did enjoy the holiday. Hope you did too. 
> 
> I have had a look at the protocol-selenium plugin, but it was a bit difficult 
> to understand. It appears it only works with Firefox. Does it work at all 
> with Chrome? I was also not sure of what values to set for the properties. It 
> seems you need to have some form of GUI to run it?
> 
> Is there some documentation or tutorial on this? My guess is that some of the 
> pages might not be crawling because of JavaScript. I might be wrong, but 
> would want to test that.
> 
> I think would be quite good for my use case because I am trying to implement 
> broad crawling. 
> 
> My use case is Text mining  and Machine Learning classification. I'm indexing 
> into Solr and then transferring the indexed data to MongoDB for further 
> processing.
> 
> Kind regards,
> Roseline
> 
> 
> 
> 
> 
> -Original Message-
> From: Sebastian Nagel 
> Sent: 12 January 2022 16:12
> To: user@nutch.apache.org
> Subject: Re: Nutch not crawling all URLs
> 
> Hi Roseline,
> 
>> the mail below went to my junk folder and I didn't see it.
> 
> No problem. I hope you nevertheless enjoyed the holidays.
> And sorry for any delays but I want to emphasize that Nutch is a community 
> project and in doubt it might take a few days until somebody finds the time 
> to respond.
> 
>> Could you confirm if you received all the urls I sent?
> 
> I've tried a view URLs you sent but not all of them. And to figure out the 
> reason why a site isn't crawled may take some time.
> 
>> Another question I have about Nutch is if it has problems with 
>> crawling javascript pages?
> 
> By default Nutch does not exe

RE: Nutch not crawling all URLs

2022-01-13 Thread Roseline Antai

Thank you Sebastian.

I will try these.

Kind regards,
Roseline



-Original Message-
From: Sebastian Nagel  
Sent: 13 January 2022 12:33
To: user@nutch.apache.org
Subject: Re: Nutch not crawling all URLs

Hi Roseline,

> Does it work at all with Chrome?

Yes.

> It seems you need to have some form of GUI to run it?

You need graphics libraries but not necessarily a graphical system.
Normally, you run the browser in headless mode without a graphical device 
(monitor) attached.

> Is there some documentation or tutorial on this?

The README is probably the best documentation:
  src/plugin/protocol-selenium/README.md
  
https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fnutch%2Ftree%2Fmaster%2Fsrc%2Fplugin%2Fprotocol-seleniumdata=04%7C01%7Croseline.antai%40strath.ac.uk%7C32d425eaebf34b01ecb008d9d690d0a1%7C631e0763153347eba5cd0457bee5944e%7C0%7C0%7C637776741438542976%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000sdata=ybLqkJoR3ZMsMSQO7cB3cvdnFk%2F3%2F9JDDds0yA%2BpyVk%3Dreserved=0

After installing chromium and the Selenium chromedriver, you can test whether 
it works by running:

bin/nutch parsechecker \
  -Dplugin.includes='protocol-selenium|parse-tika' \
  -Dselenium.grid.binary=/path/to/selenium/chromedriver  \
  -Dselenium.driver=chrome \
  -Dselenium.enable.headless=true \
  -followRedirects -dumpText  URL


Caveat: because browsers are updated frequently, you may need to use a recent 
driver version and eventually also upgrade the Selenium dependencies in Nutch.
Let us know if you need help here.


> My use case is Text mining  and Machine Learning classification. I'm 
> indexing into Solr and then transferring the indexed data to MongoDB 
> for further processing.

Well, that's not an untypical use case for Nutch. And it's a long pipeline:
fetching, HTML parsing, extracting content fields, indexing. Nutch is able to 
perform all steps. But I'd agree that browser-based crawling isn't that easy to 
set up with Nutch.

Best,
Sebastian

On 1/12/22 17:53, Roseline Antai wrote:
> Hi Sebastian,
> 
> Thank you. I did enjoy the holiday. Hope you did too. 
> 
> I have had a look at the protocol-selenium plugin, but it was a bit difficult 
> to understand. It appears it only works with Firefox. Does it work at all 
> with Chrome? I was also not sure of what values to set for the properties. It 
> seems you need to have some form of GUI to run it?
> 
> Is there some documentation or tutorial on this? My guess is that some of the 
> pages might not be crawling because of JavaScript. I might be wrong, but 
> would want to test that.
> 
> I think would be quite good for my use case because I am trying to implement 
> broad crawling. 
> 
> My use case is Text mining  and Machine Learning classification. I'm indexing 
> into Solr and then transferring the indexed data to MongoDB for further 
> processing.
> 
> Kind regards,
> Roseline
> 
> 
> 
> 
> 
> -Original Message-----
> From: Sebastian Nagel 
> Sent: 12 January 2022 16:12
> To: user@nutch.apache.org
> Subject: Re: Nutch not crawling all URLs
> 
> Hi Roseline,
> 
>> the mail below went to my junk folder and I didn't see it.
> 
> No problem. I hope you nevertheless enjoyed the holidays.
> And sorry for any delays but I want to emphasize that Nutch is a community 
> project and in doubt it might take a few days until somebody finds the time 
> to respond.
> 
>> Could you confirm if you received all the urls I sent?
> 
> I've tried a view URLs you sent but not all of them. And to figure out the 
> reason why a site isn't crawled may take some time.
> 
>> Another question I have about Nutch is if it has problems with 
>> crawling javascript pages?
> 
> By default Nutch does not execute Javascript.
> 
> There is a protocol plugin (protocol-selenium) to fetch pages with a web 
> browser between Nutch and the crawled sites. This way Javascript pages can be 
> crawled for the price of some overhead in setting up the crawler and network 
> traffic to fetch the page dependencies (CSS, Javascript, images).
> 
>> I would ideally love to make the crawler work for my URLs than start 
>> checking for other crawlers and waste all the work so far.
> 
> Well, Nutch is for sure a good crawler. But as always: there are many other 
> crawlers which might be better adapted to a specific use case.
> 
> What's your use case? Indexing into Solr or Elasticsearch?
> Text mining? Archiving content?
> 
> Best,
> Sebastian
> 
> On 1/12/22 12:13, Roseline Antai wrote:
>> Hi Sebastian,
>>
>> For some reason, the mail below went to my junk folder and I didn't see it.
>>
>> The notco page - 
>> https://eur02.safelinks.pro

Re: Nutch not crawling all URLs

2022-01-13 Thread Sebastian Nagel

Hi Roseline,

> Does it work at all with Chrome?

Yes.

> It seems you need to have some form of GUI to run it?

You need graphics libraries but not necessarily a graphical system.
Normally, you run the browser in headless mode without a graphical
device (monitor) attached.

> Is there some documentation or tutorial on this?

The README is probably the best documentation:
  src/plugin/protocol-selenium/README.md
  https://github.com/apache/nutch/tree/master/src/plugin/protocol-selenium

After installing chromium and the Selenium chromedriver, you can test whether it
works by running:

bin/nutch parsechecker \
  -Dplugin.includes='protocol-selenium|parse-tika' \
  -Dselenium.grid.binary=/path/to/selenium/chromedriver  \
  -Dselenium.driver=chrome \
  -Dselenium.enable.headless=true \
  -followRedirects -dumpText  URL


Caveat: because browsers are updated frequently, you may need to use a recent
driver version and eventually also upgrade the Selenium dependencies in Nutch.
Let us know if you need help here.


> My use case is Text mining  and Machine Learning classification. I'm indexing
> into Solr and then transferring the indexed data to MongoDB for further
> processing.

Well, that's not an untypical use case for Nutch. And it's a long pipeline:
fetching, HTML parsing, extracting content fields, indexing. Nutch is able to
perform all steps. But I'd agree that browser-based crawling isn't that easy
to set up with Nutch.

Best,
Sebastian

On 1/12/22 17:53, Roseline Antai wrote:
> Hi Sebastian,
> 
> Thank you. I did enjoy the holiday. Hope you did too. 
> 
> I have had a look at the protocol-selenium plugin, but it was a bit difficult 
> to understand. It appears it only works with Firefox. Does it work at all 
> with Chrome? I was also not sure of what values to set for the properties. It 
> seems you need to have some form of GUI to run it?
> 
> Is there some documentation or tutorial on this? My guess is that some of the 
> pages might not be crawling because of JavaScript. I might be wrong, but 
> would want to test that.
> 
> I think would be quite good for my use case because I am trying to implement 
> broad crawling. 
> 
> My use case is Text mining  and Machine Learning classification. I'm indexing 
> into Solr and then transferring the indexed data to MongoDB for further 
> processing.
> 
> Kind regards,
> Roseline
> 
> 
> 
> 
> 
> -Original Message-----
> From: Sebastian Nagel  
> Sent: 12 January 2022 16:12
> To: user@nutch.apache.org
> Subject: Re: Nutch not crawling all URLs
> 
> Hi Roseline,
> 
>> the mail below went to my junk folder and I didn't see it.
> 
> No problem. I hope you nevertheless enjoyed the holidays.
> And sorry for any delays but I want to emphasize that Nutch is a community 
> project and in doubt it might take a few days until somebody finds the time 
> to respond.
> 
>> Could you confirm if you received all the urls I sent?
> 
> I've tried a view URLs you sent but not all of them. And to figure out the 
> reason why a site isn't crawled may take some time.
> 
>> Another question I have about Nutch is if it has problems with 
>> crawling javascript pages?
> 
> By default Nutch does not execute Javascript.
> 
> There is a protocol plugin (protocol-selenium) to fetch pages with a web 
> browser between Nutch and the crawled sites. This way Javascript pages can be 
> crawled for the price of some overhead in setting up the crawler and network 
> traffic to fetch the page dependencies (CSS, Javascript, images).
> 
>> I would ideally love to make the crawler work for my URLs than start 
>> checking for other crawlers and waste all the work so far.
> 
> Well, Nutch is for sure a good crawler. But as always: there are many other 
> crawlers which might be better adapted to a specific use case.
> 
> What's your use case? Indexing into Solr or Elasticsearch?
> Text mining? Archiving content?
> 
> Best,
> Sebastian
> 
> On 1/12/22 12:13, Roseline Antai wrote:
>> Hi Sebastian,
>>
>> For some reason, the mail below went to my junk folder and I didn't see it.
>>
>> The notco page - 
>> https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fnotco.com%2Fdata=04%7C01%7Croseline.antai%40strath.ac.uk%7Cae7544cf983445bf72b108d9d5e66484%7C631e0763153347eba5cd0457bee5944e%7C0%7C0%7C637776009124020328%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000sdata=%2BZq7R6H954Q9u6Xt%2FnkeHYEKjx4rhFF62PvP2dQEW5U%3Dreserved=0
>>   was not indexed, no. When I enabled redirects, I was able to get a few 
>> pages, but they don't seem valid.
>>
>> Could you confirm if you received all the urls I sent

RE: Nutch not crawling all URLs

2022-01-12 Thread Roseline Antai

Hi Sebastian,

Thank you. I did enjoy the holiday. Hope you did too. 

I have had a look at the protocol-selenium plugin, but it was a bit difficult 
to understand. It appears it only works with Firefox. Does it work at all with 
Chrome? I was also not sure of what values to set for the properties. It seems 
you need to have some form of GUI to run it?

Is there some documentation or tutorial on this? My guess is that some of the 
pages might not be crawling because of JavaScript. I might be wrong, but would 
want to test that.

I think would be quite good for my use case because I am trying to implement 
broad crawling. 

My use case is Text mining  and Machine Learning classification. I'm indexing 
into Solr and then transferring the indexed data to MongoDB for further 
processing.

Kind regards,
Roseline

-Original Message-
From: Sebastian Nagel  
Sent: 12 January 2022 16:12
To: user@nutch.apache.org
Subject: Re: Nutch not crawling all URLs

Hi Roseline,

> the mail below went to my junk folder and I didn't see it.

No problem. I hope you nevertheless enjoyed the holidays.
And sorry for any delays but I want to emphasize that Nutch is a community 
project and in doubt it might take a few days until somebody finds the time to 
respond.

> Could you confirm if you received all the urls I sent?

I've tried a view URLs you sent but not all of them. And to figure out the 
reason why a site isn't crawled may take some time.

> Another question I have about Nutch is if it has problems with 
> crawling javascript pages?

By default Nutch does not execute Javascript.

There is a protocol plugin (protocol-selenium) to fetch pages with a web 
browser between Nutch and the crawled sites. This way Javascript pages can be 
crawled for the price of some overhead in setting up the crawler and network 
traffic to fetch the page dependencies (CSS, Javascript, images).

> I would ideally love to make the crawler work for my URLs than start 
> checking for other crawlers and waste all the work so far.

Well, Nutch is for sure a good crawler. But as always: there are many other 
crawlers which might be better adapted to a specific use case.

What's your use case? Indexing into Solr or Elasticsearch?
Text mining? Archiving content?

Best,
Sebastian

On 1/12/22 12:13, Roseline Antai wrote:
> Hi Sebastian,
> 
> For some reason, the mail below went to my junk folder and I didn't see it.
> 
> The notco page - 
> https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fnotco.com%2Fdata=04%7C01%7Croseline.antai%40strath.ac.uk%7Cae7544cf983445bf72b108d9d5e66484%7C631e0763153347eba5cd0457bee5944e%7C0%7C0%7C637776009124020328%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000sdata=%2BZq7R6H954Q9u6Xt%2FnkeHYEKjx4rhFF62PvP2dQEW5U%3Dreserved=0
>   was not indexed, no. When I enabled redirects, I was able to get a few 
> pages, but they don't seem valid.
> 
> Could you confirm if you received all the urls I sent?
> 
> Another question I have about Nutch is if it has problems with crawling 
> javascript pages?
> 
> I would ideally love to make the crawler work for my URLs than start checking 
> for other crawlers and waste all the work so far.
> 
> Just adding again, this is what my nutch-site.xml looks like:
> 
> 
> 
> 
> 
> 
> 
>  http.agent.name
>  Nutch Crawler
> 
> 
> http.agent.email 
> datalake.ng at gmail d  
> db.ignore.internal.links
> false
> 
> 
> db.ignore.external.links
> true
> 
> 
>   plugin.includes
>   
> protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|an
> chor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)|langu
> age-identifier
> 
> 
> parser.skip.truncated
> false
> Boolean value for whether we should skip parsing for 
> truncated documents. By default this
> property is activated due to extremely high levels of CPU which 
> parsing can sometimes take.
> 
> 
>  
>db.max.outlinks.per.page
>-1
>The maximum number of outlinks that we'll process for a page.
>If this value is nonnegative (>=0), at most db.max.outlinks.per.page 
> outlinks
>will be processed for a page; otherwise, all outlinks will be processed.
>
>  
> 
>   http.content.limit
>   -1
>   The length limit for downloaded content using the http://
>   protocol, in bytes. If this value is nonnegative (>=0), content longer
>   than it will be truncated; otherwise, no truncation at all. Do not
>   confuse this setting with the file.content.limit setting.
>   
> 
> 
>   db.ignore.external.links.mode
>   byHost
> 
> 
>   db.injector.overwrite
>   true
> 
> 
>   http.timeout
>   5
>   The default network time

Re: Nutch not crawling all URLs

2022-01-12 Thread Sebastian Nagel

Hi Roseline,

> the mail below went to my junk folder and I didn't see it.

No problem. I hope you nevertheless enjoyed the holidays.
And sorry for any delays but I want to emphasize that Nutch is
a community project and in doubt it might take a few days
until somebody finds the time to respond.

> Could you confirm if you received all the urls I sent?

I've tried a view URLs you sent but not all of them. And to figure out the
reason why a site isn't crawled may take some time.

> Another question I have about Nutch is if it has problems with crawling
> javascript pages?

By default Nutch does not execute Javascript.

There is a protocol plugin (protocol-selenium) to fetch pages with a web
browser between Nutch and the crawled sites. This way Javascript pages
can be crawled for the price of some overhead in setting up the crawler and
network traffic to fetch the page dependencies (CSS, Javascript, images).

> I would ideally love to make the crawler work for my URLs than start checking
> for other crawlers and waste all the work so far.

Well, Nutch is for sure a good crawler. But as always: there are many
other crawlers which might be better adapted to a specific use case.

What's your use case? Indexing into Solr or Elasticsearch?
Text mining? Archiving content?

Best,
Sebastian

On 1/12/22 12:13, Roseline Antai wrote:
> Hi Sebastian,
> 
> For some reason, the mail below went to my junk folder and I didn't see it.
> 
> The notco page - https://notco.com/  was not indexed, no. When I enabled 
> redirects, I was able to get a few pages, but they don't seem valid.
> 
> Could you confirm if you received all the urls I sent?
> 
> Another question I have about Nutch is if it has problems with crawling 
> javascript pages?
> 
> I would ideally love to make the crawler work for my URLs than start checking 
> for other crawlers and waste all the work so far.
> 
> Just adding again, this is what my nutch-site.xml looks like:
> 
> 
> 
> 
> 
> 
> 
>  http.agent.name
>  Nutch Crawler
> 
> 
> http.agent.email 
> datalake.ng at gmail d 
> 
> 
> db.ignore.internal.links
> false
> 
> 
> db.ignore.external.links
> true
> 
> 
>   plugin.includes
>   
> protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)|language-identifier
> 
> 
> parser.skip.truncated
> false
> Boolean value for whether we should skip parsing for 
> truncated documents. By default this
> property is activated due to extremely high levels of CPU which 
> parsing can sometimes take.
> 
> 
>  
>db.max.outlinks.per.page
>-1
>The maximum number of outlinks that we'll process for a page.
>If this value is nonnegative (>=0), at most db.max.outlinks.per.page 
> outlinks
>will be processed for a page; otherwise, all outlinks will be processed.
>
>  
> 
>   http.content.limit
>   -1
>   The length limit for downloaded content using the http://
>   protocol, in bytes. If this value is nonnegative (>=0), content longer
>   than it will be truncated; otherwise, no truncation at all. Do not
>   confuse this setting with the file.content.limit setting.
>   
> 
> 
>   db.ignore.external.links.mode
>   byHost
> 
> 
>   db.injector.overwrite
>   true
> 
> 
>   http.timeout
>   5
>   The default network timeout, in milliseconds.
> 
> 
> 
> Regards,
> Roseline
> 
> -Original Message-
> From: Sebastian Nagel  
> Sent: 13 December 2021 17:35
> To: user@nutch.apache.org
> Subject: Re: Nutch not crawling all URLs
> 
> CAUTION: This email originated outside the University. Check before clicking 
> links or attachments.
> 
> Hi Roseline,
> 
>> 5,36405,0,https://eur02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.notco.com%2Fdata=04%7C01%7Croseline.antai%40strath.ac.uk%7C258445a075aa43faa5e908d9be5ee02f%7C631e0763153347eba5cd0457bee5944e%7C0%7C0%7C637750137990569166%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000sdata=rPjrY5Lr3LWwK0%2BB%2FOibIDmKHGjvQRntpN6jCb4iZRs%3Dreserved=0
> 
> What is the status for   
> https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fnotco.com%2Fwhichdata=04%7C01%7Croseline.antai%40strath.ac.uk%7C258445a075aa43faa5e908d9be5ee02f%7C631e0763153347eba5cd0457bee5944e%7C0%7C0%7C637750137990569166%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000sdata=%2FAsVkcpGQhNGDGvpdZ7stxEaPM%2BQlrEfsWhZOnJEhZQ%3Dreserved=0
>  is the final redirect
> target?
> Is the target page indexed?
> 
> ~Sebastian
>

RE: Nutch not crawling all URLs

2022-01-12 Thread Roseline Antai

Hi Sebastian,

For some reason, the mail below went to my junk folder and I didn't see it.

The notco page - https://notco.com/  was not indexed, no. When I enabled 
redirects, I was able to get a few pages, but they don't seem valid.

Could you confirm if you received all the urls I sent?

Another question I have about Nutch is if it has problems with crawling 
javascript pages?

I would ideally love to make the crawler work for my URLs than start checking 
for other crawlers and waste all the work so far.

Just adding again, this is what my nutch-site.xml looks like:







 http.agent.name
 Nutch Crawler


http.agent.email 
datalake.ng at gmail d 


db.ignore.internal.links
false


db.ignore.external.links
true


  plugin.includes
  
protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)|language-identifier


parser.skip.truncated
false
Boolean value for whether we should skip parsing for truncated 
documents. By default this
property is activated due to extremely high levels of CPU which parsing 
can sometimes take.


 
   db.max.outlinks.per.page
   -1
   The maximum number of outlinks that we'll process for a page.
   If this value is nonnegative (>=0), at most db.max.outlinks.per.page outlinks
   will be processed for a page; otherwise, all outlinks will be processed.
   
 

  http.content.limit
  -1
  The length limit for downloaded content using the http://
  protocol, in bytes. If this value is nonnegative (>=0), content longer
  than it will be truncated; otherwise, no truncation at all. Do not
  confuse this setting with the file.content.limit setting.
  


  db.ignore.external.links.mode
  byHost


  db.injector.overwrite
  true


  http.timeout
  5
  The default network timeout, in milliseconds.



Regards,
Roseline

-Original Message-
From: Sebastian Nagel  
Sent: 13 December 2021 17:35
To: user@nutch.apache.org
Subject: Re: Nutch not crawling all URLs

CAUTION: This email originated outside the University. Check before clicking 
links or attachments.

Hi Roseline,

> 5,36405,0,https://eur02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.notco.com%2Fdata=04%7C01%7Croseline.antai%40strath.ac.uk%7C258445a075aa43faa5e908d9be5ee02f%7C631e0763153347eba5cd0457bee5944e%7C0%7C0%7C637750137990569166%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000sdata=rPjrY5Lr3LWwK0%2BB%2FOibIDmKHGjvQRntpN6jCb4iZRs%3Dreserved=0

What is the status for   
https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fnotco.com%2Fwhichdata=04%7C01%7Croseline.antai%40strath.ac.uk%7C258445a075aa43faa5e908d9be5ee02f%7C631e0763153347eba5cd0457bee5944e%7C0%7C0%7C637750137990569166%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000sdata=%2FAsVkcpGQhNGDGvpdZ7stxEaPM%2BQlrEfsWhZOnJEhZQ%3Dreserved=0
 is the final redirect
target?
Is the target page indexed?

~Sebastian

RE: Nutch not crawling all URLs

2021-12-15 Thread Roseline Antai

Hi,

Following on from my previous enquiry, I was told to send the URLs I was trying 
to crawl to be tried from your end. I sent these, but did not receive any 
confirmation of receipt. Can you please confirm if these have been received, 
and when I can look forward to getting some feedback?

I re-crawled the 20 URLs again and reset these values to the default values 
from the nutch-default.xml file:


property>

  fetcher.server.delay

  65.0







  fetcher.server.min.delay

  25.0







 fetcher.max.crawl.delay

 70





I then set the ignore external links to false, as below:



db.ignore.external.links
false




I set the following property to 'true' still:





  db.ignore.also.redirects

  true

  If true, the fetcher checks redirects the same way as

  links when ignoring internal or external links. Set to false to

 follow redirects despite the values for db.ignore.external.links and

  db.ignore.internal.links.

  





13 URLs were fetched, but of these, the URLs that were originally not fetched 
returned very few pages related to the domain in the URL, and this makes me 
question the crawl.



Also, when external links are not ignored, the crawler does go off onto 
different sites, like Wikipedia, news sites, etc. This is hardly efficient as 
it spends so long on the crawl fetching irrelevant pages. How can this be 
controlled in Nutch? If crawling up to 900 URLs as we are going to be doing, 
will we have to write regex expressions for each URL  in the regex-urlfilter in 
order to stick to the domains in the URL?



There is no explicit documentation on how to do this in Nutch, unless I have 
missed it?



Is there something that should be done that I'm not doing, or is Nutch just 
incapable of efficient crawling?



Regards,

Roseline


When I crawled,

Dr Roseline Antai
Research Fellow
Hunter Centre for Entrepreneurship
Strathclyde Business School
University of Strathclyde, Glasgow, UK

[Small eMail Sig]
The University of Strathclyde is a charitable body, registered in Scotland, 
number SC015263.


From: Roseline Antai
Sent: 13 December 2021 12:02
To: 'user@nutch.apache.org' 
Subject: Nutch not crawling all URLs

Hi,

I am working with Apache nutch 1.18 and Solr. I have set up the system 
successfully, but I'm now having the problem that Nutch is refusing to crawl 
all the URLs. I am now at a loss as to what I should do to correct this 
problem. It fetches about half of the URLs in the seed.txt file.

For instance, when I inject 20 URLs, only 9 are fetched. I have made a number 
of changes based on the suggestions I saw on the Nutch forum, as well as on 
Stack overflow, but nothing seems to work.

This is what my nutch-site.xml file looks like:









http.agent.name
Nutch Crawler


http.agent.email
datalake.ng at gmail d


db.ignore.internal.links
false


db.ignore.external.links
true


  plugin.includes
  
protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)|language-identifier


parser.skip.truncated
false
Boolean value for whether we should skip parsing for truncated 
documents. By default this
property is activated due to extremely high levels of CPU which parsing 
can sometimes take.



   db.max.outlinks.per.page
   -1
   The maximum number of outlinks that we'll process for a page.
   If this value is nonnegative (>=0), at most db.max.outlinks.per.page outlinks
   will be processed for a page; otherwise, all outlinks will be processed.
   


  http.content.limit
  -1
  The length limit for downloaded content using the http://
  protocol, in bytes. If this value is nonnegative (>=0), content longer
  than it will be truncated; otherwise, no truncation at all. Do not
  confuse this setting with the file.content.limit setting.
  


  db.ignore.external.links.mode
  byDomain


  db.injector.overwrite
  true


  http.timeout
  5
  The default network timeout, in milliseconds.



Other changes I have made include changing the following in nutch-default.xml:

property>
  http.redirect.max
  2
  The maximum number of redirects the fetcher will follow when
  trying to fetch a page. If set to negative or 0, fetcher won't immediately
  follow redirected URLs, instead it will record them for later fetching.
  

**




  ftp.timeoutftp://ftp.timeout>

  10







  ftp.server.timeoutftp://ftp.server.timeout>

  15



*


property>

  fetcher.server.delay

  65.0







  fetcher.server.min.delay

  25.0







 fetcher.max.crawl.delay

 70



I also commented out the line below in the regex-urlfilter file:


# skip URLs containing certain characters as probable queries, etc.

-[?*!@=]

Nothing seems to work.

What is it that I'm not doing, or doing wrongly here?

Regards,
Roseline

Dr Roseline Antai
Research

Re: Nutch not crawling all URLs

2021-12-13 Thread Ayhan Koyun

Hi Sebastian,

yes that I mean. Do you think there is a way to learn more about,
how to crawl any website?!

>Hi Ayhan,

>you mean?
>https://stackoverflow.com/questions/69352136/nutch-does-not-crawl-sites-that-allows-all-crawler-by-robots-txt

Re: Nutch not crawling all URLs

2021-12-13 Thread Sebastian Nagel

Hi Ayhan,

you mean?
https://stackoverflow.com/questions/69352136/nutch-does-not-crawl-sites-that-allows-all-crawler-by-robots-txt

Sebastian

On 12/13/21 20:59, Ayhan Koyun wrote:
> Hi,
> 
> as I wrote before, it seems that I am not the only one who can not crawl all 
> the seed.txt url's. I couldn't
> find a solution really. I collected 450 domains and approximately 200 nutch 
> will or can not crawl. I want to
> know why this happens, is there a solution to force crawling sites?
> 
> It would be great to get a satisfying answer, to know why this happens and 
> maybe how to solve it.
> 
> Thanks in advance
> 
> Ayhan
> 
>

RE: Nutch not crawling all URLs

2021-12-13 Thread Roseline Antai

Hi Lewis,

I got a really weird reply back from what I sent, so I thought it better to 
resend the URLs again. I'm unsure if you got the URLs in the first instance.

I've sent them as a text file attachment as well.

http://traivefinance.com
http://www.ceibal.edu.uy
http://www.talovstudio.com
https://portaltelemedicina.com.br/en/telediagnostic-platform
http://www.notco.com
http://www.saiph.org
http://www.1doc3.com
http://www.amanda-care.com
http://www.unimadx.com
http://www.upch.edu.pe/bioinformatic/anemia/app/
http://www.u-planner.com
http://alerce.science
http://paraempleo.mtess.gov.py
http://layers.hemav.com
http://www.sisben.gov.co
http://ialab.com.ar
http://www.kilimo.com.ar
https://www.facebook.com/CIRSYS
http://www.dymaxionlabs.com
http://cedo.org

Regards,
Roseline

Dr Roseline Antai
Research Fellow
Hunter Centre for Entrepreneurship
Strathclyde Business School
University of Strathclyde, Glasgow, UK


The University of Strathclyde is a charitable body, registered in Scotland, 
number SC015263.


-Original Message-
From: lewis john mcgibbney  
Sent: 13 December 2021 17:18
To: user@nutch.apache.org
Subject: Re: Nutch not crawling all URLs

CAUTION: This email originated outside the University. Check before clicking 
links or attachments.

Hi Roseline,
Looks like you are ignoring external URLs... that could be the problem right 
there.
I encourage you to track counters on inject, generate and fetch phases to 
understand where records may be being dropped.
Are the seeds you are using public? If so please post your seed file so we can 
try.
Thank you
lewismc

On Mon, Dec 13, 2021 at 04:02  wrote:

>
> user Digest 13 Dec 2021 12:02:41 - Issue 3132
>
> Topics (messages 34682 through 34682)
>
> Nutch not crawling all URLs
> 34682 by: Roseline Antai
>
> Administrivia:
>
> -
> To post to the list, e-mail: user@nutch.apache.org To unsubscribe, 
> e-mail: user-digest-unsubscr...@nutch.apache.org
> For additional commands, e-mail: user-digest-h...@nutch.apache.org
>
> --
>
>
>
>
> -- Forwarded message --
> From: Roseline Antai 
> To: "user@nutch.apache.org" 
> Cc:
> Bcc:
> Date: Mon, 13 Dec 2021 12:02:26 +
> Subject: Nutch not crawling all URLs
>
> Hi,
>
>
>
> I am working with Apache nutch 1.18 and Solr. I have set up the system 
> successfully, but I'm now having the problem that Nutch is refusing to 
> crawl all the URLs. I am now at a loss as to what I should do to 
> correct this problem. It fetches about half of the URLs in the seed.txt file.
>
>
>
> For instance, when I inject 20 URLs, only 9 are fetched. I have made a 
> number of changes based on the suggestions I saw on the Nutch forum, 
> as well as on Stack overflow, but nothing seems to work.
>
>
>
> This is what my nutch-site.xml file looks like:
>
>
>
>
>
> **
>
> **
>
>
>
> **
>
>
>
> **
>
> **
>
> *http.agent.name 
> <https://eur02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fhttp
> .agent.name%2Fdata=04%7C01%7Croseline.antai%40strath.ac.uk%7C3ed3
> 407a9d4e488c0cc008d9be5c91ff%7C631e0763153347eba5cd0457bee5944e%7C0%7C
> 0%7C637750127056879003%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJ
> QIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000sdata=Or%2Ft4Sp
> S%2BtOnYXTPPXvnlEdYHapSd84pJU4klj9Tkkg%3Dreserved=0>*
>
> *Nutch Crawler*
>
> **
>
> **
>
> *http.agent.email *
>
> *datalake.ng at gmail d *
>
> **
>
> **
>
> *db.ignore.internal.links*
>
> *false*
>
> **
>
> **
>
> *db.ignore.external.links*
>
> *true*
>
> **
>
> **
>
> *  plugin.includes*
>
> *
> protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|an
> chor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)|langu
> age-identifier*
>
> **
>
> **
>
> *parser.skip.truncated*
>
> *false*
>
> *Boolean value for whether we should skip parsing for
> truncated documents. By default this*
>
> *property is activated due to extremely high levels of CPU which
> parsing can sometimes take.*
>
> **
>
> **
>
> **
>
> *   db.max.outlinks.per.page
> <https://eur02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fdb.m
> ax.outlinks.per.page%2Fdata=04%7C01%7Croseline.antai%40strath.ac.
> uk%7C3ed3407a9d4e488c0cc008d9be5c91ff%7C631e0763153347eba5cd0457bee594
> 4e%7C0%7C0%7C637750127056879003%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLj
> AwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000sdata=
> hSYwQY8gfRV8uPs5X5jYS

Re: Nutch not crawling all URLs

2021-12-13 Thread Ayhan Koyun

Hi,

as I wrote before, it seems that I am not the only one who can not crawl all 
the seed.txt url's. I couldn't
find a solution really. I collected 450 domains and approximately 200 nutch 
will or can not crawl. I want to
know why this happens, is there a solution to force crawling sites?

It would be great to get a satisfying answer, to know why this happens and 
maybe how to solve it.

Thanks in advance

Ayhan

Re: Nutch not crawling all URLs

2021-12-13 Thread Sebastian Nagel

Hi Roseline,

> 5,36405,0,http://www.notco.com

What is the status for   https://notco.com/which is the final redirect
target?
Is the target page indexed?

~Sebastian

RE: Nutch not crawling all URLs

2021-12-13 Thread Roseline Antai

Hi Lewis,

Yes, there are public websites. Below are the 20 test URLs I've been trying to 
crawl.

http://traivefinance.com
http://www.ceibal.edu.uy
http://www.talovstudio.com
https://portaltelemedicina.com.br/en/telediagnostic-platform
http://www.notco.com
http://www.saiph.org
http://www.1doc3.com
http://www.amanda-care.com
http://www.unimadx.com
http://www.upch.edu.pe/bioinformatic/anemia/app/
http://www.u-planner.com
http://alerce.science
http://paraempleo.mtess.gov.py
http://layers.hemav.com
http://www.sisben.gov.co
http://ialab.com.ar
http://www.kilimo.com.ar
https://www.facebook.com/CIRSYS
http://www.dymaxionlabs.com
http://cedo.org


This is a count of the pages for the URLs crawled and not crawled. As can be 
seen, some are very large, while some are '0'.


,Project_id,Document Length,url
0,36400,0, http://www.trapview.com/v2/en/
1,36401,0,http://traivefinance.com
2,36402,2344075,http://www.ceibal.edu.uy
3,36403,35072,http://www.talovstudio.com
4,36404,1384658,https://portaltelemedicina.com.br/en/telediagnostic-platform
5,36405,0,http://www.notco.com
6,36406,0,http://www.saiph.org
7,36407,246009,http://www.1doc3.com
8,36408,43190,http://www.amanda-care.com
9,36409,0,http://www.unimadx.com
10,36410,0,http://www.upch.edu.pe/bioinformatic/anemia/app/
11,36411,0,http://www.u-planner.com
12,36412,8084,http://alerce.science
13,36413,0,http://paraempleo.mtess.gov.py
14,36414,0,http://layers.hemav.com
15,36415,0,http://www.sisben.gov.co
16,36416,3794113,http://ialab.com.ar
17,36417,0,http://www.kilimo.com.ar
18,36418,0,https://www.facebook.com/CIRSYS
19,36419,49062,http://www.dymaxionlabs.com
20,36420,1281267,http://cedo.org


Regards,
Roseline

Dr Roseline Antai
Research Fellow
Hunter Centre for Entrepreneurship
Strathclyde Business School
University of Strathclyde, Glasgow, UK


The University of Strathclyde is a charitable body, registered in Scotland, 
number SC015263.


-Original Message-
From: lewis john mcgibbney  
Sent: 13 December 2021 17:18
To: user@nutch.apache.org
Subject: Re: Nutch not crawling all URLs

CAUTION: This email originated outside the University. Check before clicking 
links or attachments.

Hi Roseline,
Looks like you are ignoring external URLs... that could be the problem right 
there.
I encourage you to track counters on inject, generate and fetch phases to 
understand where records may be being dropped.
Are the seeds you are using public? If so please post your seed file so we can 
try.
Thank you
lewismc

On Mon, Dec 13, 2021 at 04:02  wrote:

>
> user Digest 13 Dec 2021 12:02:41 - Issue 3132
>
> Topics (messages 34682 through 34682)
>
> Nutch not crawling all URLs
> 34682 by: Roseline Antai
>
> Administrivia:
>
> -
> To post to the list, e-mail: user@nutch.apache.org To unsubscribe, 
> e-mail: user-digest-unsubscr...@nutch.apache.org
> For additional commands, e-mail: user-digest-h...@nutch.apache.org
>
> --
>
>
>
>
> -- Forwarded message --
> From: Roseline Antai 
> To: "user@nutch.apache.org" 
> Cc:
> Bcc:
> Date: Mon, 13 Dec 2021 12:02:26 +
> Subject: Nutch not crawling all URLs
>
> Hi,
>
>
>
> I am working with Apache nutch 1.18 and Solr. I have set up the system 
> successfully, but I'm now having the problem that Nutch is refusing to 
> crawl all the URLs. I am now at a loss as to what I should do to 
> correct this problem. It fetches about half of the URLs in the seed.txt file.
>
>
>
> For instance, when I inject 20 URLs, only 9 are fetched. I have made a 
> number of changes based on the suggestions I saw on the Nutch forum, 
> as well as on Stack overflow, but nothing seems to work.
>
>
>
> This is what my nutch-site.xml file looks like:
>
>
>
>
>
> **
>
> **
>
>
>
> **
>
>
>
> **
>
> **
>
> *http.agent.name 
> <https://eur02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fhttp
> .agent.name%2Fdata=04%7C01%7Croseline.antai%40strath.ac.uk%7C3ed3
> 407a9d4e488c0cc008d9be5c91ff%7C631e0763153347eba5cd0457bee5944e%7C0%7C
> 0%7C637750127056879003%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJ
> QIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000sdata=Or%2Ft4Sp
> S%2BtOnYXTPPXvnlEdYHapSd84pJU4klj9Tkkg%3Dreserved=0>*
>
> *Nutch Crawler*
>
> **
>
> **
>
> *http.agent.email *
>
> *datalake.ng at gmail d *
>
> **
>
> **
>
> *db.ignore.internal.links*
>
> *false*
>
> **
>
> **
>
> *db.ignore.external.links*
>
> *true*
>
> **
>
> **
>
> *  plugin.includes*
>
> *
> protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|an
>

Re: Nutch not crawling all URLs

2021-12-13 Thread lewis john mcgibbney

Hi Roseline,
Looks like you are ignoring external URLs… that could be the problem right
there.
I encourage you to track counters on inject, generate and fetch phases to
understand where records may be being dropped.
Are the seeds you are using public? If so please post your seed file so we
can try.
Thank you
lewismc

On Mon, Dec 13, 2021 at 04:02  wrote:

>
> user Digest 13 Dec 2021 12:02:41 - Issue 3132
>
> Topics (messages 34682 through 34682)
>
> Nutch not crawling all URLs
> 34682 by: Roseline Antai
>
> Administrivia:
>
> -
> To post to the list, e-mail: user@nutch.apache.org
> To unsubscribe, e-mail: user-digest-unsubscr...@nutch.apache.org
> For additional commands, e-mail: user-digest-h...@nutch.apache.org
>
> --
>
>
>
>
> -- Forwarded message --
> From: Roseline Antai 
> To: "user@nutch.apache.org" 
> Cc:
> Bcc:
> Date: Mon, 13 Dec 2021 12:02:26 +
> Subject: Nutch not crawling all URLs
>
> Hi,
>
>
>
> I am working with Apache nutch 1.18 and Solr. I have set up the system
> successfully, but I’m now having the problem that Nutch is refusing to
> crawl all the URLs. I am now at a loss as to what I should do to correct
> this problem. It fetches about half of the URLs in the seed.txt file.
>
>
>
> For instance, when I inject 20 URLs, only 9 are fetched. I have made a
> number of changes based on the suggestions I saw on the Nutch forum, as
> well as on Stack overflow, but nothing seems to work.
>
>
>
> This is what my nutch-site.xml file looks like:
>
>
>
>
>
> **
>
> **
>
>
>
> **
>
>
>
> **
>
> **
>
> *http.agent.name <http://http.agent.name>*
>
> *Nutch Crawler*
>
> **
>
> **
>
> *http.agent.email *
>
> *datalake.ng at gmail d *
>
> **
>
> **
>
> *db.ignore.internal.links*
>
> *false*
>
> **
>
> **
>
> *db.ignore.external.links*
>
> *true*
>
> **
>
> **
>
> *  plugin.includes*
>
> *
> protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)|language-identifier*
>
> **
>
> **
>
> *parser.skip.truncated*
>
> *false*
>
> *Boolean value for whether we should skip parsing for
> truncated documents. By default this*
>
> *property is activated due to extremely high levels of CPU which
> parsing can sometimes take.*
>
> **
>
> **
>
> **
>
> *   db.max.outlinks.per.page
> <http://db.max.outlinks.per.page>*
>
> *   -1*
>
> *   The maximum number of outlinks that we'll process for a
> page.*
>
> *   If this value is nonnegative (>=0), at most db.max.outlinks.per.page
> <http://db.max.outlinks.per.page> outlinks*
>
> *   will be processed for a page; otherwise, all outlinks will be
> processed.*
>
> *   *
>
> **
>
> **
>
> *  http.content.limit*
>
> *  -1*
>
> *  The length limit for downloaded content using the http://*
>
> *  protocol, in bytes. If this value is nonnegative (>=0), content longer*
>
> *  than it will be truncated; otherwise, no truncation at all. Do not*
>
> *  confuse this setting with the file.content.limit setting.*
>
> *  *
>
> **
>
> **
>
> *  db.ignore.external.links.mode*
>
> *  byDomain*
>
> **
>
> **
>
> *  db.injector.overwrite*
>
> *  true*
>
> **
>
> **
>
> *  http.timeout*
>
> *  5*
>
> *  The default network timeout, in
> milliseconds.*
>
> **
>
> **
>
>
>
> Other changes I have made include changing the following in
> nutch-default.xml:
>
>
>
> *property>*
>
> *  http.redirect.max*
>
> *  2*
>
> *  The maximum number of redirects the fetcher will follow
> when*
>
> *  trying to fetch a page. If set to negative or 0, fetcher won't
> immediately*
>
> *  follow redirected URLs, instead it will record them for later fetching.*
>
> *  *
>
> **
>
> 
>
>
>
> **
>
> *  ftp.timeout*
>
> *  10*
>
> **
>
>
>
> **
>
> *  ftp.server.timeout*
>
> *  15*
>
> **
>
>
>
> ***
>
>
>
> *property>*
>
> *  fetcher.server.delay*
>
> *  65.0*
>
> **
>
>
>
> **
>
> *  fetcher.server.min.delay*
>
> *  25.0*
>
> **
>
>
>
> **
>
> * fetcher.max.crawl.delay*
>
> * 70*
>
> * *
>
>
>
> I also commented out the line below in the regex-urlfilter file:
>
>
>
> *# skip URLs containing certain characters as probable queries, etc.*
>
> *-[?*!@=]*
>
>
>
> Nothing seems to work.
>
>
>
> What is it that I’m not doing, or doing wrongly here?
>
>
>
> Regards,
>
> Roseline
>
>
>
> *Dr Roseline Antai*
>
> *Research Fellow*
>
> Hunter Centre for Entrepreneurship
>
> Strathclyde Business School
>
> University of Strathclyde, Glasgow, UK
>
>
>
> [image: Small eMail Sig]
>
> The University of Strathclyde is a charitable body, registered in
> Scotland, number SC015263.
>
>
>
>
>
-- 
http://home.apache.org/~lewismc/
http://people.apache.org/keys/committer/lewismc

Re: Nutch not crawling all URLs

2021-12-13 Thread Sebastian Nagel

Hi,

(looping back to user@nutch - sorry, pressed the wrong reply button)

> Some URLs were denied by robots.txt,
> while a few failed with: Http code=403

That's two ways to signalize that these pages shouldn't be crawled,
HTTP 403 means "Forbidden".

> 3. I looked in CrawlDB and most URLs are in there, but were not
> crawled, so this is something that I find very confusing.

The CrawlDb contains also URLs which failed for various reasons.
That's important in order to avoid that 404s, 403s etc. are retried
again and again.

> I also ran some of the URLs that were not crawled through this -
>  bin/nutch parsechecker -followRedirects -checkRobotsTxt https://myUrl
>
> Some of the URLs that failed were parsed successfully, so I'm really
> confused as to why there are no results for them.
>

The "HTTP 403 Forbidden" could be from a "anti-bot protection" software.
If you run parsechecker at a different time or from a different machine,
and not repeatedly or too often it may succeed.

Best,
Sebastian

On 12/13/21 17:48, Roseline Antai wrote:
> Hi Sebastian,
> 
> Thank you for your reply.
> 
> 1. All URLs were injected, so 20 in total. None was rejected.
> 
> 2. I've had a look at the log files and I can see that some of the URLs could 
> not be fetched because the robot.txt file could not be found. Would this be a 
> reason for why the fetch failed? Is there a way to go around it?
> 
> Some URLs were denied by robots.txt, while a few failed with: Http code=403 
> 
> 3. I looked in CrawlDB and most URLs are in there, but were not crawled, so 
> this is something that I find very confusing.
> 
> I also ran some of the URLs that were not crawled through this - bin/nutch 
> parsechecker -followRedirects -checkRobotsTxt https://myUrl
> 
> Some of the URLs that failed were parsed successfully, so I'm really confused 
> as to why there are no results for them.
> 
> Do you have any suggestions on what I should try?
> 
> Dr Roseline Antai
> Research Fellow
> Hunter Centre for Entrepreneurship
> Strathclyde Business School
> University of Strathclyde, Glasgow, UK
> 
> 
> The University of Strathclyde is a charitable body, registered in Scotland, 
> number SC015263.
> 
> 
> -Original Message-
> From: Sebastian Nagel  
> Sent: 13 December 2021 12:19
> To: Roseline Antai 
> Subject: Re: Nutch not crawling all URLs
> 
> CAUTION: This email originated outside the University. Check before clicking 
> links or attachments.
> 
> Hi Roseline,
> 
>> For instance, when I inject 20 URLs, only 9 are fetched.
> 
> Are there any log messages about the 11 unfetched URLs in the log files.  Try 
> to look for a file "hadoop.log"
> (usually in $NUTCH_HOME/logs/) and look
>  1. how many URLs have been injected.
> There should be a log message
>  ... Total new urls injected: ...
>  2. If all 20 URLs are injected, there should be log
> messages about these URLs from the fetcher:
>  FetcherThread ... fetching ...
> If the fetch fails, there might be a message about
> this.
>  3. Look into the CrawlDb for the missing URLs.
>   bin/nutch readdb .../crawldb -url 
> or
>   bin/nutch readdb .../crawldb -dump ...
> You get the command-line options by calling
>   bin/nutch readdb
> without any arguments
> 
> Alternatively, verify fetching and parsing the URLs by
>   bin/nutch parsechecker -followRedirects -checkRobotsTxt https://myUrl
> 
> 
>> 
>> db.ignore.external.links
>> true
>> 
> 
> Eventually, you want to follow redirects anyway? See
> 
> 
>   db.ignore.also.redirects
>   true
>   If true, the fetcher checks redirects the same way as
>   links when ignoring internal or external links. Set to false to
>   follow redirects despite the values for db.ignore.external.links and
>   db.ignore.internal.links.
>   
> 
> 
> Best,
> Sebastian
> 
> 
> On 12/13/21 13:02, Roseline Antai wrote:
>> Hi,
>>
>>
>>
>> I am working with Apache nutch 1.18 and Solr. I have set up the system 
>> successfully, but I’m now having the problem that Nutch is refusing to 
>> crawl all the URLs. I am now at a loss as to what I should do to 
>> correct this problem. It fetches about half of the URLs in the seed.txt file.
>>
>>
>>
>> For instance, when I inject 20 URLs, only 9 are fetched. I have made a 
>> number of changes based on the suggestions I saw on the Nutch forum, 
>> as well as on Stack overflow, but nothing seems to work.
>>
>>
>>
>> This is what my nutch-site.xml file looks like:
>>
>>
>>
>>
>>
>>

Re: Nutch not crawling all URLs

2021-12-13 Thread Sebastian Greenholtz

I don't know how I joined this mailing list but please take me off of this
list, I have not used Nutch for a long time.

Thanks!

On Mon, Dec 13, 2021 at 7:03 AM Roseline Antai 
wrote:

> Hi,
>
>
>
> I am working with Apache nutch 1.18 and Solr. I have set up the system
> successfully, but I’m now having the problem that Nutch is refusing to
> crawl all the URLs. I am now at a loss as to what I should do to correct
> this problem. It fetches about half of the URLs in the seed.txt file.
>
>
>
> For instance, when I inject 20 URLs, only 9 are fetched. I have made a
> number of changes based on the suggestions I saw on the Nutch forum, as
> well as on Stack overflow, but nothing seems to work.
>
>
>
> This is what my nutch-site.xml file looks like:
>
>
>
>
>
> **
>
> **
>
>
>
> **
>
>
>
> **
>
> **
>
> *http.agent.name *
>
> *Nutch Crawler*
>
> **
>
> **
>
> *http.agent.email *
>
> *datalake.ng at gmail d *
>
> **
>
> **
>
> *db.ignore.internal.links*
>
> *false*
>
> **
>
> **
>
> *db.ignore.external.links*
>
> *true*
>
> **
>
> **
>
> *  plugin.includes*
>
> *
> protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)|language-identifier*
>
> **
>
> **
>
> *parser.skip.truncated*
>
> *false*
>
> *Boolean value for whether we should skip parsing for
> truncated documents. By default this*
>
> *property is activated due to extremely high levels of CPU which
> parsing can sometimes take.*
>
> **
>
> **
>
> **
>
> *   db.max.outlinks.per.page
> *
>
> *   -1*
>
> *   The maximum number of outlinks that we'll process for a
> page.*
>
> *   If this value is nonnegative (>=0), at most db.max.outlinks.per.page
>  outlinks*
>
> *   will be processed for a page; otherwise, all outlinks will be
> processed.*
>
> *   *
>
> **
>
> **
>
> *  http.content.limit*
>
> *  -1*
>
> *  The length limit for downloaded content using the http://*
>
> *  protocol, in bytes. If this value is nonnegative (>=0), content longer*
>
> *  than it will be truncated; otherwise, no truncation at all. Do not*
>
> *  confuse this setting with the file.content.limit setting.*
>
> *  *
>
> **
>
> **
>
> *  db.ignore.external.links.mode*
>
> *  byDomain*
>
> **
>
> **
>
> *  db.injector.overwrite*
>
> *  true*
>
> **
>
> **
>
> *  http.timeout*
>
> *  5*
>
> *  The default network timeout, in
> milliseconds.*
>
> **
>
> **
>
>
>
> Other changes I have made include changing the following in
> nutch-default.xml:
>
>
>
> *property>*
>
> *  http.redirect.max*
>
> *  2*
>
> *  The maximum number of redirects the fetcher will follow
> when*
>
> *  trying to fetch a page. If set to negative or 0, fetcher won't
> immediately*
>
> *  follow redirected URLs, instead it will record them for later fetching.*
>
> *  *
>
> **
>
> 
>
>
>
> **
>
> *  ftp.timeout*
>
> *  10*
>
> **
>
>
>
> **
>
> *  ftp.server.timeout*
>
> *  15*
>
> **
>
>
>
> ***
>
>
>
> *property>*
>
> *  fetcher.server.delay*
>
> *  65.0*
>
> **
>
>
>
> **
>
> *  fetcher.server.min.delay*
>
> *  25.0*
>
> **
>
>
>
> **
>
> * fetcher.max.crawl.delay*
>
> * 70*
>
> * *
>
>
>
> I also commented out the line below in the regex-urlfilter file:
>
>
>
> *# skip URLs containing certain characters as probable queries, etc.*
>
> *-[?*!@=]*
>
>
>
> Nothing seems to work.
>
>
>
> What is it that I’m not doing, or doing wrongly here?
>
>
>
> Regards,
>
> Roseline
>
>
>
> *Dr Roseline Antai*
>
> *Research Fellow*
>
> Hunter Centre for Entrepreneurship
>
> Strathclyde Business School
>
> University of Strathclyde, Glasgow, UK
>
>
>
> [image: Small eMail Sig]
>
> The University of Strathclyde is a charitable body, registered in
> Scotland, number SC015263.
>
>
>
>
>

Nutch not crawling all URLs

2021-12-13 Thread Roseline Antai

Hi,

I am working with Apache nutch 1.18 and Solr. I have set up the system
successfully, but I'm now having the problem that Nutch is refusing to crawl
all the URLs. I am now at a loss as to what I should do to correct this
problem. It fetches about half of the URLs in the seed.txt file.

For instance, when I inject 20 URLs, only 9 are fetched. I have made a number
of changes based on the suggestions I saw on the Nutch forum, as well as on
Stack overflow, but nothing seems to work.

This is what my nutch-site.xml file looks like:

http.agent.name
Nutch Crawler

http.agent.email
datalake.ng at gmail d

db.ignore.internal.links
false

db.ignore.external.links
true

plugin.includes

parser.skip.truncated
false
Boolean value for whether we should skip parsing for truncated
documents. By default this
property is activated due to extremely high levels of CPU which parsing
can sometimes take.

db.max.outlinks.per.page
-1
The maximum number of outlinks that we'll process for a page.
If this value is nonnegative (>=0), at most db.max.outlinks.per.page outlinks
will be processed for a page; otherwise, all outlinks will be processed.

http.content.limit
-1
The length limit for downloaded content using the http://
protocol, in bytes. If this value is nonnegative (>=0), content longer
than it will be truncated; otherwise, no truncation at all. Do not
confuse this setting with the file.content.limit setting.

db.ignore.external.links.mode
byDomain

db.injector.overwrite
true

http.timeout
5
The default network timeout, in milliseconds.

Other changes I have made include changing the following in nutch-default.xml:

property>
http.redirect.max
2
The maximum number of redirects the fetcher will follow when
trying to fetch a page. If set to negative or 0, fetcher won't immediately
follow redirected URLs, instead it will record them for later fetching.

ftp.timeout

ftp.server.timeout

property>

fetcher.server.delay

65.0

fetcher.server.min.delay

25.0

fetcher.max.crawl.delay

I also commented out the line below in the regex-urlfilter file:

# skip URLs containing certain characters as probable queries, etc.

-[?*!@=]

Nothing seems to work.

What is it that I'm not doing, or doing wrongly here?

Regards,
Roseline

Dr Roseline Antai
Research Fellow
Hunter Centre for Entrepreneurship
Strathclyde Business School
University of Strathclyde, Glasgow, UK

[Small eMail Sig]
The University of Strathclyde is a charitable body, registered in Scotland,
number SC015263.

RE: Nutch not crawling all URLs

RE: Nutch not crawling all URLs

Re: Nutch not crawling all URLs

RE: Nutch not crawling all URLs

Re: Nutch not crawling all URLs

RE: Nutch not crawling all URLs

RE: Nutch not crawling all URLs

Re: Nutch not crawling all URLs

Re: Nutch not crawling all URLs

RE: Nutch not crawling all URLs

Re: Nutch not crawling all URLs

Re: Nutch not crawling all URLs

RE: Nutch not crawling all URLs

Re: Nutch not crawling all URLs

Re: Nutch not crawling all URLs

Re: Nutch not crawling all URLs

Nutch not crawling all URLs

17 matches

Site Navigation

Mail list logo

Footer information