subject:"RE\: Nutch fetching times out at 3 hours, not sure why."

Re: Nutch fetching times out at 3 hours, not sure why.

2018-05-01 Thread Chip Calhoun

Hi Sebastian,


Yes, that explains it! Now I wish I'd pasted my crawl command in the first 
place. I'll leave it alone for now, but if it becomes an issue again I know 
where to check. Thank you.


Chip


From: Sebastian Nagel <wastl.na...@googlemail.com>
Sent: Monday, April 30, 2018 4:53:20 PM
To: user@nutch.apache.org
Subject: Re: Nutch fetching times out at 3 hours, not sure why.

Hi Chip,

got it, you probably run bin/crawl which has the option:
  --time-limit-fetch  Number of minutes allocated to the 
fetching [default: 180]

It's good to have a time limit, in case a single server responds too slowly.

Best,
Sebastian

On 04/30/2018 09:04 PM, Chip Calhoun wrote:
> Hi Sebastian,
>
> Thank you! Increasing my fetcher.threads.per.queue both fixed my crawl and 
> saved me a lot of time.
>
> I'm still bewildered by the original problem, though. Both my 
> fetcher.timelimit.mins and my fetcher.max.exceptions.per.queue are set to -1. 
> I'll ignore it unless it causes a problem for my other cores.
>
> Chip
>
> -Original Message-
> From: Sebastian Nagel [mailto:wastl.na...@googlemail.com]
> Sent: Monday, April 30, 2018 12:21 PM
> To: user@nutch.apache.org
> Subject: Re: Nutch fetching times out at 3 hours, not sure why.
>
> Hi,
>
> if you still see the log message
>
>fetcher.FetchItemQueues - * queue: https://history.aip.org >> dropping!
>
> then it can be only
>  - fetcher.timelimit.mins
>  - fetcher.max.exceptions.per.queue
>
>> I crawl a list of roughly 2600 URLs all on my local server
>
> If this is the case you can crawl more aggressively, see
>   fetcher.server.delay
> or even fetch in parallel from your host, see
>   fetcher.threads.per.queue
>
> Best,
> Sebastian
>
> On 04/30/2018 04:44 PM, Chip Calhoun wrote:
>> I'm still experimenting with this. I had been crawling with a depth of 1 
>> because I don't need anything outside my URLs list, but I tried with a depth 
>> of 10. It went through a crawl loop that ended after 3 hours, then a second 
>> 3 hour crawl loop, then a third shorter loop. It still stopped 5 URLs short 
>> of crawling every URL in my list, though it crawled a few I hadn't included.
>>
>> Are these 3 hour loops standard for large crawls?
>>
>> -----Original Message-----
>> From: Chip Calhoun [mailto:ccalh...@aip.org]
>> Sent: Tuesday, April 17, 2018 3:27 PM
>> To: user@nutch.apache.org
>> Subject: RE: Nutch fetching times out at 3 hours, not sure why.
>>
>> I'm on 1.12, and mine also defaulted at -1. It does not fail at the same 
>> URL, or even at the same point in a URL's fetcher loop; it really seems to 
>> be time based.
>>
>> -Original Message-
>> From: Sadiki Latty [mailto:sla...@uottawa.ca]
>> Sent: Tuesday, April 17, 2018 1:43 PM
>> To: user@nutch.apache.org
>> Subject: RE: Nutch fetching times out at 3 hours, not sure why.
>>
>> Which version are you running? That value is defaulted to -1 in my current 
>> version (1.14)  so shouldn't be something you should have needed to change. 
>> My crawls, by default, go for as much as even 12 hours with little to no 
>> tweaking necessary from the nutch-default. Something else is causing it. Is 
>> it always the same URL that it fails at?
>>
>> -Original Message-
>> From: Chip Calhoun [mailto:ccalh...@aip.org]
>> Sent: April-17-18 10:45 AM
>> To: user@nutch.apache.org
>> Subject: Nutch fetching times out at 3 hours, not sure why.
>>
>> I crawl a list of roughly 2600 URLs all on my local server, and I'm only 
>> crawling around 1000 of them. The fetcher quits after exactly 3 hours (give 
>> or take a few milliseconds) with this message in the log:
>>
>> 2018-04-13 15:50:48,885 INFO  fetcher.FetchItemQueues - * queue: 
>> https://history.aip.org >> dropping!
>>
>> I've seen that 3 hours is the default in some Nutch installations, but I've 
>> got my fetcher.timelimit.mins set to -1. I'm sure I'm missing something 
>> obvious. Any thoughts would be greatly appreciated. Thank you.
>>
>> Chip Calhoun
>> Digital Archivist
>> Niels Bohr Library & Archives
>> American Institute of Physics
>> One Physics Ellipse
>> College Park, MD  20740-3840  USA
>> Tel: +1 301-209-3180
>> Email: ccalh...@aip.org
>> https://www.aip.org/history-programs/niels-bohr-library
>>
>

Re: Nutch fetching times out at 3 hours, not sure why.

2018-04-30 Thread Sebastian Nagel

Hi Chip,

got it, you probably run bin/crawl which has the option:
  --time-limit-fetch  Number of minutes allocated to the 
fetching [default: 180]

It's good to have a time limit, in case a single server responds too slowly.

Best,
Sebastian

On 04/30/2018 09:04 PM, Chip Calhoun wrote:
> Hi Sebastian,
> 
> Thank you! Increasing my fetcher.threads.per.queue both fixed my crawl and 
> saved me a lot of time.
> 
> I'm still bewildered by the original problem, though. Both my 
> fetcher.timelimit.mins and my fetcher.max.exceptions.per.queue are set to -1. 
> I'll ignore it unless it causes a problem for my other cores.
> 
> Chip
> 
> -Original Message-
> From: Sebastian Nagel [mailto:wastl.na...@googlemail.com] 
> Sent: Monday, April 30, 2018 12:21 PM
> To: user@nutch.apache.org
> Subject: Re: Nutch fetching times out at 3 hours, not sure why.
> 
> Hi,
> 
> if you still see the log message
> 
>fetcher.FetchItemQueues - * queue: https://history.aip.org >> dropping!
> 
> then it can be only
>  - fetcher.timelimit.mins
>  - fetcher.max.exceptions.per.queue
> 
>> I crawl a list of roughly 2600 URLs all on my local server
> 
> If this is the case you can crawl more aggressively, see
>   fetcher.server.delay
> or even fetch in parallel from your host, see
>   fetcher.threads.per.queue
> 
> Best,
> Sebastian
> 
> On 04/30/2018 04:44 PM, Chip Calhoun wrote:
>> I'm still experimenting with this. I had been crawling with a depth of 1 
>> because I don't need anything outside my URLs list, but I tried with a depth 
>> of 10. It went through a crawl loop that ended after 3 hours, then a second 
>> 3 hour crawl loop, then a third shorter loop. It still stopped 5 URLs short 
>> of crawling every URL in my list, though it crawled a few I hadn't included. 
>>
>> Are these 3 hour loops standard for large crawls?
>>
>> -Original Message-
>> From: Chip Calhoun [mailto:ccalh...@aip.org] 
>> Sent: Tuesday, April 17, 2018 3:27 PM
>> To: user@nutch.apache.org
>> Subject: RE: Nutch fetching times out at 3 hours, not sure why.
>>
>> I'm on 1.12, and mine also defaulted at -1. It does not fail at the same 
>> URL, or even at the same point in a URL's fetcher loop; it really seems to 
>> be time based. 
>>
>> -Original Message-
>> From: Sadiki Latty [mailto:sla...@uottawa.ca] 
>> Sent: Tuesday, April 17, 2018 1:43 PM
>> To: user@nutch.apache.org
>> Subject: RE: Nutch fetching times out at 3 hours, not sure why.
>>
>> Which version are you running? That value is defaulted to -1 in my current 
>> version (1.14)  so shouldn't be something you should have needed to change. 
>> My crawls, by default, go for as much as even 12 hours with little to no 
>> tweaking necessary from the nutch-default. Something else is causing it. Is 
>> it always the same URL that it fails at?
>>
>> -Original Message-
>> From: Chip Calhoun [mailto:ccalh...@aip.org] 
>> Sent: April-17-18 10:45 AM
>> To: user@nutch.apache.org
>> Subject: Nutch fetching times out at 3 hours, not sure why.
>>
>> I crawl a list of roughly 2600 URLs all on my local server, and I'm only 
>> crawling around 1000 of them. The fetcher quits after exactly 3 hours (give 
>> or take a few milliseconds) with this message in the log:
>>
>> 2018-04-13 15:50:48,885 INFO  fetcher.FetchItemQueues - * queue: 
>> https://history.aip.org >> dropping!
>>
>> I've seen that 3 hours is the default in some Nutch installations, but I've 
>> got my fetcher.timelimit.mins set to -1. I'm sure I'm missing something 
>> obvious. Any thoughts would be greatly appreciated. Thank you.
>>
>> Chip Calhoun
>> Digital Archivist
>> Niels Bohr Library & Archives
>> American Institute of Physics
>> One Physics Ellipse
>> College Park, MD  20740-3840  USA
>> Tel: +1 301-209-3180
>> Email: ccalh...@aip.org
>> https://www.aip.org/history-programs/niels-bohr-library
>>
>

RE: Nutch fetching times out at 3 hours, not sure why.

2018-04-30 Thread Chip Calhoun

Hi Sebastian,

Thank you! Increasing my fetcher.threads.per.queue both fixed my crawl and 
saved me a lot of time.

I'm still bewildered by the original problem, though. Both my 
fetcher.timelimit.mins and my fetcher.max.exceptions.per.queue are set to -1. 
I'll ignore it unless it causes a problem for my other cores.

Chip

-Original Message-
From: Sebastian Nagel [mailto:wastl.na...@googlemail.com] 
Sent: Monday, April 30, 2018 12:21 PM
To: user@nutch.apache.org
Subject: Re: Nutch fetching times out at 3 hours, not sure why.

Hi,

if you still see the log message

   fetcher.FetchItemQueues - * queue: https://history.aip.org >> dropping!

then it can be only
 - fetcher.timelimit.mins
 - fetcher.max.exceptions.per.queue

> I crawl a list of roughly 2600 URLs all on my local server

If this is the case you can crawl more aggressively, see
  fetcher.server.delay
or even fetch in parallel from your host, see
  fetcher.threads.per.queue

Best,
Sebastian

On 04/30/2018 04:44 PM, Chip Calhoun wrote:
> I'm still experimenting with this. I had been crawling with a depth of 1 
> because I don't need anything outside my URLs list, but I tried with a depth 
> of 10. It went through a crawl loop that ended after 3 hours, then a second 3 
> hour crawl loop, then a third shorter loop. It still stopped 5 URLs short of 
> crawling every URL in my list, though it crawled a few I hadn't included. 
> 
> Are these 3 hour loops standard for large crawls?
> 
> -Original Message-
> From: Chip Calhoun [mailto:ccalh...@aip.org] 
> Sent: Tuesday, April 17, 2018 3:27 PM
> To: user@nutch.apache.org
> Subject: RE: Nutch fetching times out at 3 hours, not sure why.
> 
> I'm on 1.12, and mine also defaulted at -1. It does not fail at the same URL, 
> or even at the same point in a URL's fetcher loop; it really seems to be time 
> based. 
> 
> -Original Message-
> From: Sadiki Latty [mailto:sla...@uottawa.ca] 
> Sent: Tuesday, April 17, 2018 1:43 PM
> To: user@nutch.apache.org
> Subject: RE: Nutch fetching times out at 3 hours, not sure why.
> 
> Which version are you running? That value is defaulted to -1 in my current 
> version (1.14)  so shouldn't be something you should have needed to change. 
> My crawls, by default, go for as much as even 12 hours with little to no 
> tweaking necessary from the nutch-default. Something else is causing it. Is 
> it always the same URL that it fails at?
> 
> -Original Message-
> From: Chip Calhoun [mailto:ccalh...@aip.org] 
> Sent: April-17-18 10:45 AM
> To: user@nutch.apache.org
> Subject: Nutch fetching times out at 3 hours, not sure why.
> 
> I crawl a list of roughly 2600 URLs all on my local server, and I'm only 
> crawling around 1000 of them. The fetcher quits after exactly 3 hours (give 
> or take a few milliseconds) with this message in the log:
> 
> 2018-04-13 15:50:48,885 INFO  fetcher.FetchItemQueues - * queue: 
> https://history.aip.org >> dropping!
> 
> I've seen that 3 hours is the default in some Nutch installations, but I've 
> got my fetcher.timelimit.mins set to -1. I'm sure I'm missing something 
> obvious. Any thoughts would be greatly appreciated. Thank you.
> 
> Chip Calhoun
> Digital Archivist
> Niels Bohr Library & Archives
> American Institute of Physics
> One Physics Ellipse
> College Park, MD  20740-3840  USA
> Tel: +1 301-209-3180
> Email: ccalh...@aip.org
> https://www.aip.org/history-programs/niels-bohr-library
>

Re: Nutch fetching times out at 3 hours, not sure why.

2018-04-30 Thread Sebastian Nagel

Hi,

if you still see the log message

   fetcher.FetchItemQueues - * queue: https://history.aip.org >> dropping!

then it can be only
 - fetcher.timelimit.mins
 - fetcher.max.exceptions.per.queue

> I crawl a list of roughly 2600 URLs all on my local server

If this is the case you can crawl more aggressively, see
  fetcher.server.delay
or even fetch in parallel from your host, see
  fetcher.threads.per.queue

Best,
Sebastian

On 04/30/2018 04:44 PM, Chip Calhoun wrote:
> I'm still experimenting with this. I had been crawling with a depth of 1 
> because I don't need anything outside my URLs list, but I tried with a depth 
> of 10. It went through a crawl loop that ended after 3 hours, then a second 3 
> hour crawl loop, then a third shorter loop. It still stopped 5 URLs short of 
> crawling every URL in my list, though it crawled a few I hadn't included. 
> 
> Are these 3 hour loops standard for large crawls?
> 
> -Original Message-
> From: Chip Calhoun [mailto:ccalh...@aip.org] 
> Sent: Tuesday, April 17, 2018 3:27 PM
> To: user@nutch.apache.org
> Subject: RE: Nutch fetching times out at 3 hours, not sure why.
> 
> I'm on 1.12, and mine also defaulted at -1. It does not fail at the same URL, 
> or even at the same point in a URL's fetcher loop; it really seems to be time 
> based. 
> 
> -Original Message-
> From: Sadiki Latty [mailto:sla...@uottawa.ca] 
> Sent: Tuesday, April 17, 2018 1:43 PM
> To: user@nutch.apache.org
> Subject: RE: Nutch fetching times out at 3 hours, not sure why.
> 
> Which version are you running? That value is defaulted to -1 in my current 
> version (1.14)  so shouldn't be something you should have needed to change. 
> My crawls, by default, go for as much as even 12 hours with little to no 
> tweaking necessary from the nutch-default. Something else is causing it. Is 
> it always the same URL that it fails at?
> 
> -Original Message-
> From: Chip Calhoun [mailto:ccalh...@aip.org] 
> Sent: April-17-18 10:45 AM
> To: user@nutch.apache.org
> Subject: Nutch fetching times out at 3 hours, not sure why.
> 
> I crawl a list of roughly 2600 URLs all on my local server, and I'm only 
> crawling around 1000 of them. The fetcher quits after exactly 3 hours (give 
> or take a few milliseconds) with this message in the log:
> 
> 2018-04-13 15:50:48,885 INFO  fetcher.FetchItemQueues - * queue: 
> https://history.aip.org >> dropping!
> 
> I've seen that 3 hours is the default in some Nutch installations, but I've 
> got my fetcher.timelimit.mins set to -1. I'm sure I'm missing something 
> obvious. Any thoughts would be greatly appreciated. Thank you.
> 
> Chip Calhoun
> Digital Archivist
> Niels Bohr Library & Archives
> American Institute of Physics
> One Physics Ellipse
> College Park, MD  20740-3840  USA
> Tel: +1 301-209-3180
> Email: ccalh...@aip.org
> https://www.aip.org/history-programs/niels-bohr-library
>

RE: Nutch fetching times out at 3 hours, not sure why.

2018-04-30 Thread Chip Calhoun

I'm still experimenting with this. I had been crawling with a depth of 1 
because I don't need anything outside my URLs list, but I tried with a depth of 
10. It went through a crawl loop that ended after 3 hours, then a second 3 hour 
crawl loop, then a third shorter loop. It still stopped 5 URLs short of 
crawling every URL in my list, though it crawled a few I hadn't included. 

Are these 3 hour loops standard for large crawls?

-Original Message-
From: Chip Calhoun [mailto:ccalh...@aip.org] 
Sent: Tuesday, April 17, 2018 3:27 PM
To: user@nutch.apache.org
Subject: RE: Nutch fetching times out at 3 hours, not sure why.

I'm on 1.12, and mine also defaulted at -1. It does not fail at the same URL, 
or even at the same point in a URL's fetcher loop; it really seems to be time 
based. 

-Original Message-
From: Sadiki Latty [mailto:sla...@uottawa.ca] 
Sent: Tuesday, April 17, 2018 1:43 PM
To: user@nutch.apache.org
Subject: RE: Nutch fetching times out at 3 hours, not sure why.

Which version are you running? That value is defaulted to -1 in my current 
version (1.14)  so shouldn't be something you should have needed to change. My 
crawls, by default, go for as much as even 12 hours with little to no tweaking 
necessary from the nutch-default. Something else is causing it. Is it always 
the same URL that it fails at?

-Original Message-
From: Chip Calhoun [mailto:ccalh...@aip.org] 
Sent: April-17-18 10:45 AM
To: user@nutch.apache.org
Subject: Nutch fetching times out at 3 hours, not sure why.

I crawl a list of roughly 2600 URLs all on my local server, and I'm only 
crawling around 1000 of them. The fetcher quits after exactly 3 hours (give or 
take a few milliseconds) with this message in the log:

2018-04-13 15:50:48,885 INFO  fetcher.FetchItemQueues - * queue: 
https://history.aip.org >> dropping!

I've seen that 3 hours is the default in some Nutch installations, but I've got 
my fetcher.timelimit.mins set to -1. I'm sure I'm missing something obvious. 
Any thoughts would be greatly appreciated. Thank you.

Chip Calhoun
Digital Archivist
Niels Bohr Library & Archives
American Institute of Physics
One Physics Ellipse
College Park, MD  20740-3840  USA
Tel: +1 301-209-3180
Email: ccalh...@aip.org
https://www.aip.org/history-programs/niels-bohr-library

RE: Nutch fetching times out at 3 hours, not sure why.

2018-04-19 Thread Chip Calhoun

Hi Lewis,

I'm using Nutch 1.2.

Chip

-Original Message-
From: lewis john mcgibbney [mailto:lewi...@apache.org] 
Sent: Wednesday, April 18, 2018 1:55 PM
To: user@nutch.apache.org
Subject: Re: Nutch fetching times out at 3 hours, not sure why.

Hi Chip,
Which version of Nutch are you using?

On Tue, Apr 17, 2018 at 7:45 AM, <user-digest-h...@nutch.apache.org> wrote:

> From: Chip Calhoun <ccalh...@aip.org>
> To: "user@nutch.apache.org" <user@nutch.apache.org>
> Cc:
> Bcc:
> Date: Tue, 17 Apr 2018 14:45:01 +
> Subject: Nutch fetching times out at 3 hours, not sure why.
> I crawl a list of roughly 2600 URLs all on my local server, and I'm 
> only crawling around 1000 of them. The fetcher quits after exactly 3 
> hours (give or take a few milliseconds) with this message in the log:
>
> 2018-04-13 15:50:48,885 INFO  fetcher.FetchItemQueues - * queue:
> https://history.aip.org >> dropping!
>
> I've seen that 3 hours is the default in some Nutch installations, but 
> I've got my fetcher.timelimit.mins set to -1. I'm sure I'm missing 
> something obvious. Any thoughts would be greatly appreciated. Thank you.
>
> Chip Calhoun
> Digital Archivist
> Niels Bohr Library & Archives
> American Institute of Physics
> One Physics Ellipse
> College Park, MD  20740-3840  USA
> Tel: +1 301-209-3180
> Email: ccalh...@aip.org
> https://www.aip.org/history-programs/niels-bohr-library
>
>
>


--
http://home.apache.org/~lewismc/
http://people.apache.org/keys/committer/lewismc

RE: Nutch fetching times out at 3 hours, not sure why.

2018-04-19 Thread Chip Calhoun

Hi Markus,

I don't see an indication of the web server blocking me, though that sounds 
reasonable. Could there be a per-server limit in Nutch itself that we're 
overlooking, since this is all on the same server? 

Chip

-Original Message-
From: Markus Jelsma [mailto:markus.jel...@openindex.io] 
Sent: Tuesday, April 17, 2018 3:58 PM
To: user@nutch.apache.org
Subject: RE: Nutch fetching times out at 3 hours, not sure why.

Hello Chip,

I have no clue where the three hour limit could come from. Please take a 
further look in the last few minutes of the logs.

The only thing i can think of is that a webserver would block you after some 
amount of requests/time window, that would be visible in the logs. It is clear 
Nutch itself terminates the fetcher (the dropping line). That is only possible 
with an imposed time limit, or a if you reached some number of exceptions (or 
one other variable i am forgetting).

Regards,
Markus
 
-Original message-
> From:Chip Calhoun <ccalh...@aip.org>
> Sent: Tuesday 17th April 2018 21:27
> To: user@nutch.apache.org
> Subject: RE: Nutch fetching times out at 3 hours, not sure why.
> 
> I'm on 1.12, and mine also defaulted at -1. It does not fail at the same URL, 
> or even at the same point in a URL's fetcher loop; it really seems to be time 
> based. 
> 
> -Original Message-
> From: Sadiki Latty [mailto:sla...@uottawa.ca] 
> Sent: Tuesday, April 17, 2018 1:43 PM
> To: user@nutch.apache.org
> Subject: RE: Nutch fetching times out at 3 hours, not sure why.
> 
> Which version are you running? That value is defaulted to -1 in my current 
> version (1.14)  so shouldn't be something you should have needed to change. 
> My crawls, by default, go for as much as even 12 hours with little to no 
> tweaking necessary from the nutch-default. Something else is causing it. Is 
> it always the same URL that it fails at?
> 
> -Original Message-
> From: Chip Calhoun [mailto:ccalh...@aip.org] 
> Sent: April-17-18 10:45 AM
> To: user@nutch.apache.org
> Subject: Nutch fetching times out at 3 hours, not sure why.
> 
> I crawl a list of roughly 2600 URLs all on my local server, and I'm only 
> crawling around 1000 of them. The fetcher quits after exactly 3 hours (give 
> or take a few milliseconds) with this message in the log:
> 
> 2018-04-13 15:50:48,885 INFO  fetcher.FetchItemQueues - * queue: 
> https://history.aip.org >> dropping!
> 
> I've seen that 3 hours is the default in some Nutch installations, but I've 
> got my fetcher.timelimit.mins set to -1. I'm sure I'm missing something 
> obvious. Any thoughts would be greatly appreciated. Thank you.
> 
> Chip Calhoun
> Digital Archivist
> Niels Bohr Library & Archives
> American Institute of Physics
> One Physics Ellipse
> College Park, MD  20740-3840  USA
> Tel: +1 301-209-3180
> Email: ccalh...@aip.org
> https://www.aip.org/history-programs/niels-bohr-library
> 
>

Re: Nutch fetching times out at 3 hours, not sure why.

2018-04-18 Thread lewis john mcgibbney

Hi Chip,
Which version of Nutch are you using?

On Tue, Apr 17, 2018 at 7:45 AM,  wrote:

> From: Chip Calhoun 
> To: "user@nutch.apache.org" 
> Cc:
> Bcc:
> Date: Tue, 17 Apr 2018 14:45:01 +
> Subject: Nutch fetching times out at 3 hours, not sure why.
> I crawl a list of roughly 2600 URLs all on my local server, and I'm only
> crawling around 1000 of them. The fetcher quits after exactly 3 hours (give
> or take a few milliseconds) with this message in the log:
>
> 2018-04-13 15:50:48,885 INFO  fetcher.FetchItemQueues - * queue:
> https://history.aip.org >> dropping!
>
> I've seen that 3 hours is the default in some Nutch installations, but
> I've got my fetcher.timelimit.mins set to -1. I'm sure I'm missing
> something obvious. Any thoughts would be greatly appreciated. Thank you.
>
> Chip Calhoun
> Digital Archivist
> Niels Bohr Library & Archives
> American Institute of Physics
> One Physics Ellipse
> College Park, MD  20740-3840  USA
> Tel: +1 301-209-3180
> Email: ccalh...@aip.org
> https://www.aip.org/history-programs/niels-bohr-library
>
>
>


-- 
http://home.apache.org/~lewismc/
http://people.apache.org/keys/committer/lewismc

RE: Nutch fetching times out at 3 hours, not sure why.

2018-04-17 Thread Markus Jelsma

Hello Chip,

I have no clue where the three hour limit could come from. Please take a 
further look in the last few minutes of the logs.

The only thing i can think of is that a webserver would block you after some 
amount of requests/time window, that would be visible in the logs. It is clear 
Nutch itself terminates the fetcher (the dropping line). That is only possible 
with an imposed time limit, or a if you reached some number of exceptions (or 
one other variable i am forgetting).

Regards,
Markus
 
-Original message-
> From:Chip Calhoun <ccalh...@aip.org>
> Sent: Tuesday 17th April 2018 21:27
> To: user@nutch.apache.org
> Subject: RE: Nutch fetching times out at 3 hours, not sure why.
> 
> I'm on 1.12, and mine also defaulted at -1. It does not fail at the same URL, 
> or even at the same point in a URL's fetcher loop; it really seems to be time 
> based. 
> 
> -Original Message-
> From: Sadiki Latty [mailto:sla...@uottawa.ca] 
> Sent: Tuesday, April 17, 2018 1:43 PM
> To: user@nutch.apache.org
> Subject: RE: Nutch fetching times out at 3 hours, not sure why.
> 
> Which version are you running? That value is defaulted to -1 in my current 
> version (1.14)  so shouldn't be something you should have needed to change. 
> My crawls, by default, go for as much as even 12 hours with little to no 
> tweaking necessary from the nutch-default. Something else is causing it. Is 
> it always the same URL that it fails at?
> 
> -Original Message-
> From: Chip Calhoun [mailto:ccalh...@aip.org] 
> Sent: April-17-18 10:45 AM
> To: user@nutch.apache.org
> Subject: Nutch fetching times out at 3 hours, not sure why.
> 
> I crawl a list of roughly 2600 URLs all on my local server, and I'm only 
> crawling around 1000 of them. The fetcher quits after exactly 3 hours (give 
> or take a few milliseconds) with this message in the log:
> 
> 2018-04-13 15:50:48,885 INFO  fetcher.FetchItemQueues - * queue: 
> https://history.aip.org >> dropping!
> 
> I've seen that 3 hours is the default in some Nutch installations, but I've 
> got my fetcher.timelimit.mins set to -1. I'm sure I'm missing something 
> obvious. Any thoughts would be greatly appreciated. Thank you.
> 
> Chip Calhoun
> Digital Archivist
> Niels Bohr Library & Archives
> American Institute of Physics
> One Physics Ellipse
> College Park, MD  20740-3840  USA
> Tel: +1 301-209-3180
> Email: ccalh...@aip.org
> https://www.aip.org/history-programs/niels-bohr-library
> 
>

RE: Nutch fetching times out at 3 hours, not sure why.

2018-04-17 Thread Chip Calhoun

I'm on 1.12, and mine also defaulted at -1. It does not fail at the same URL, 
or even at the same point in a URL's fetcher loop; it really seems to be time 
based. 

-Original Message-
From: Sadiki Latty [mailto:sla...@uottawa.ca] 
Sent: Tuesday, April 17, 2018 1:43 PM
To: user@nutch.apache.org
Subject: RE: Nutch fetching times out at 3 hours, not sure why.

Which version are you running? That value is defaulted to -1 in my current 
version (1.14)  so shouldn't be something you should have needed to change. My 
crawls, by default, go for as much as even 12 hours with little to no tweaking 
necessary from the nutch-default. Something else is causing it. Is it always 
the same URL that it fails at?

-Original Message-
From: Chip Calhoun [mailto:ccalh...@aip.org] 
Sent: April-17-18 10:45 AM
To: user@nutch.apache.org
Subject: Nutch fetching times out at 3 hours, not sure why.

I crawl a list of roughly 2600 URLs all on my local server, and I'm only 
crawling around 1000 of them. The fetcher quits after exactly 3 hours (give or 
take a few milliseconds) with this message in the log:

2018-04-13 15:50:48,885 INFO  fetcher.FetchItemQueues - * queue: 
https://history.aip.org >> dropping!

I've seen that 3 hours is the default in some Nutch installations, but I've got 
my fetcher.timelimit.mins set to -1. I'm sure I'm missing something obvious. 
Any thoughts would be greatly appreciated. Thank you.

Chip Calhoun
Digital Archivist
Niels Bohr Library & Archives
American Institute of Physics
One Physics Ellipse
College Park, MD  20740-3840  USA
Tel: +1 301-209-3180
Email: ccalh...@aip.org
https://www.aip.org/history-programs/niels-bohr-library

RE: Nutch fetching times out at 3 hours, not sure why.

2018-04-17 Thread Sadiki Latty

Which version are you running? That value is defaulted to -1 in my current 
version (1.14)  so shouldn't be something you should have needed to change. My 
crawls, by default, go for as much as even 12 hours with little to no tweaking 
necessary from the nutch-default. Something else is causing it. Is it always 
the same URL that it fails at?

-Original Message-
From: Chip Calhoun [mailto:ccalh...@aip.org] 
Sent: April-17-18 10:45 AM
To: user@nutch.apache.org
Subject: Nutch fetching times out at 3 hours, not sure why.

I crawl a list of roughly 2600 URLs all on my local server, and I'm only 
crawling around 1000 of them. The fetcher quits after exactly 3 hours (give or 
take a few milliseconds) with this message in the log:

2018-04-13 15:50:48,885 INFO  fetcher.FetchItemQueues - * queue: 
https://history.aip.org >> dropping!

I've seen that 3 hours is the default in some Nutch installations, but I've got 
my fetcher.timelimit.mins set to -1. I'm sure I'm missing something obvious. 
Any thoughts would be greatly appreciated. Thank you.

Chip Calhoun
Digital Archivist
Niels Bohr Library & Archives
American Institute of Physics
One Physics Ellipse
College Park, MD  20740-3840  USA
Tel: +1 301-209-3180
Email: ccalh...@aip.org
https://www.aip.org/history-programs/niels-bohr-library

Re: Nutch fetching times out at 3 hours, not sure why.

Re: Nutch fetching times out at 3 hours, not sure why.

RE: Nutch fetching times out at 3 hours, not sure why.

Re: Nutch fetching times out at 3 hours, not sure why.

RE: Nutch fetching times out at 3 hours, not sure why.

RE: Nutch fetching times out at 3 hours, not sure why.

RE: Nutch fetching times out at 3 hours, not sure why.

Re: Nutch fetching times out at 3 hours, not sure why.

RE: Nutch fetching times out at 3 hours, not sure why.

RE: Nutch fetching times out at 3 hours, not sure why.

RE: Nutch fetching times out at 3 hours, not sure why.

11 matches

Site Navigation

Mail list logo

Footer information