Re: Nutch fetching times out at 3 hours, not sure why.
Hi Sebastian, Yes, that explains it! Now I wish I'd pasted my crawl command in the first place. I'll leave it alone for now, but if it becomes an issue again I know where to check. Thank you. Chip From: Sebastian Nagel <wastl.na...@googlemail.com> Sent: Monday, April 30, 2018 4:53:20 PM To: user@nutch.apache.org Subject: Re: Nutch fetching times out at 3 hours, not sure why. Hi Chip, got it, you probably run bin/crawl which has the option: --time-limit-fetch Number of minutes allocated to the fetching [default: 180] It's good to have a time limit, in case a single server responds too slowly. Best, Sebastian On 04/30/2018 09:04 PM, Chip Calhoun wrote: > Hi Sebastian, > > Thank you! Increasing my fetcher.threads.per.queue both fixed my crawl and > saved me a lot of time. > > I'm still bewildered by the original problem, though. Both my > fetcher.timelimit.mins and my fetcher.max.exceptions.per.queue are set to -1. > I'll ignore it unless it causes a problem for my other cores. > > Chip > > -Original Message- > From: Sebastian Nagel [mailto:wastl.na...@googlemail.com] > Sent: Monday, April 30, 2018 12:21 PM > To: user@nutch.apache.org > Subject: Re: Nutch fetching times out at 3 hours, not sure why. > > Hi, > > if you still see the log message > >fetcher.FetchItemQueues - * queue: https://history.aip.org >> dropping! > > then it can be only > - fetcher.timelimit.mins > - fetcher.max.exceptions.per.queue > >> I crawl a list of roughly 2600 URLs all on my local server > > If this is the case you can crawl more aggressively, see > fetcher.server.delay > or even fetch in parallel from your host, see > fetcher.threads.per.queue > > Best, > Sebastian > > On 04/30/2018 04:44 PM, Chip Calhoun wrote: >> I'm still experimenting with this. I had been crawling with a depth of 1 >> because I don't need anything outside my URLs list, but I tried with a depth >> of 10. It went through a crawl loop that ended after 3 hours, then a second >> 3 hour crawl loop, then a third shorter loop. It still stopped 5 URLs short >> of crawling every URL in my list, though it crawled a few I hadn't included. >> >> Are these 3 hour loops standard for large crawls? >> >> -----Original Message----- >> From: Chip Calhoun [mailto:ccalh...@aip.org] >> Sent: Tuesday, April 17, 2018 3:27 PM >> To: user@nutch.apache.org >> Subject: RE: Nutch fetching times out at 3 hours, not sure why. >> >> I'm on 1.12, and mine also defaulted at -1. It does not fail at the same >> URL, or even at the same point in a URL's fetcher loop; it really seems to >> be time based. >> >> -Original Message- >> From: Sadiki Latty [mailto:sla...@uottawa.ca] >> Sent: Tuesday, April 17, 2018 1:43 PM >> To: user@nutch.apache.org >> Subject: RE: Nutch fetching times out at 3 hours, not sure why. >> >> Which version are you running? That value is defaulted to -1 in my current >> version (1.14) so shouldn't be something you should have needed to change. >> My crawls, by default, go for as much as even 12 hours with little to no >> tweaking necessary from the nutch-default. Something else is causing it. Is >> it always the same URL that it fails at? >> >> -Original Message- >> From: Chip Calhoun [mailto:ccalh...@aip.org] >> Sent: April-17-18 10:45 AM >> To: user@nutch.apache.org >> Subject: Nutch fetching times out at 3 hours, not sure why. >> >> I crawl a list of roughly 2600 URLs all on my local server, and I'm only >> crawling around 1000 of them. The fetcher quits after exactly 3 hours (give >> or take a few milliseconds) with this message in the log: >> >> 2018-04-13 15:50:48,885 INFO fetcher.FetchItemQueues - * queue: >> https://history.aip.org >> dropping! >> >> I've seen that 3 hours is the default in some Nutch installations, but I've >> got my fetcher.timelimit.mins set to -1. I'm sure I'm missing something >> obvious. Any thoughts would be greatly appreciated. Thank you. >> >> Chip Calhoun >> Digital Archivist >> Niels Bohr Library & Archives >> American Institute of Physics >> One Physics Ellipse >> College Park, MD 20740-3840 USA >> Tel: +1 301-209-3180 >> Email: ccalh...@aip.org >> https://www.aip.org/history-programs/niels-bohr-library >> >
Re: Nutch fetching times out at 3 hours, not sure why.
Hi Chip, got it, you probably run bin/crawl which has the option: --time-limit-fetch Number of minutes allocated to the fetching [default: 180] It's good to have a time limit, in case a single server responds too slowly. Best, Sebastian On 04/30/2018 09:04 PM, Chip Calhoun wrote: > Hi Sebastian, > > Thank you! Increasing my fetcher.threads.per.queue both fixed my crawl and > saved me a lot of time. > > I'm still bewildered by the original problem, though. Both my > fetcher.timelimit.mins and my fetcher.max.exceptions.per.queue are set to -1. > I'll ignore it unless it causes a problem for my other cores. > > Chip > > -Original Message- > From: Sebastian Nagel [mailto:wastl.na...@googlemail.com] > Sent: Monday, April 30, 2018 12:21 PM > To: user@nutch.apache.org > Subject: Re: Nutch fetching times out at 3 hours, not sure why. > > Hi, > > if you still see the log message > >fetcher.FetchItemQueues - * queue: https://history.aip.org >> dropping! > > then it can be only > - fetcher.timelimit.mins > - fetcher.max.exceptions.per.queue > >> I crawl a list of roughly 2600 URLs all on my local server > > If this is the case you can crawl more aggressively, see > fetcher.server.delay > or even fetch in parallel from your host, see > fetcher.threads.per.queue > > Best, > Sebastian > > On 04/30/2018 04:44 PM, Chip Calhoun wrote: >> I'm still experimenting with this. I had been crawling with a depth of 1 >> because I don't need anything outside my URLs list, but I tried with a depth >> of 10. It went through a crawl loop that ended after 3 hours, then a second >> 3 hour crawl loop, then a third shorter loop. It still stopped 5 URLs short >> of crawling every URL in my list, though it crawled a few I hadn't included. >> >> Are these 3 hour loops standard for large crawls? >> >> -Original Message- >> From: Chip Calhoun [mailto:ccalh...@aip.org] >> Sent: Tuesday, April 17, 2018 3:27 PM >> To: user@nutch.apache.org >> Subject: RE: Nutch fetching times out at 3 hours, not sure why. >> >> I'm on 1.12, and mine also defaulted at -1. It does not fail at the same >> URL, or even at the same point in a URL's fetcher loop; it really seems to >> be time based. >> >> -Original Message- >> From: Sadiki Latty [mailto:sla...@uottawa.ca] >> Sent: Tuesday, April 17, 2018 1:43 PM >> To: user@nutch.apache.org >> Subject: RE: Nutch fetching times out at 3 hours, not sure why. >> >> Which version are you running? That value is defaulted to -1 in my current >> version (1.14) so shouldn't be something you should have needed to change. >> My crawls, by default, go for as much as even 12 hours with little to no >> tweaking necessary from the nutch-default. Something else is causing it. Is >> it always the same URL that it fails at? >> >> -Original Message- >> From: Chip Calhoun [mailto:ccalh...@aip.org] >> Sent: April-17-18 10:45 AM >> To: user@nutch.apache.org >> Subject: Nutch fetching times out at 3 hours, not sure why. >> >> I crawl a list of roughly 2600 URLs all on my local server, and I'm only >> crawling around 1000 of them. The fetcher quits after exactly 3 hours (give >> or take a few milliseconds) with this message in the log: >> >> 2018-04-13 15:50:48,885 INFO fetcher.FetchItemQueues - * queue: >> https://history.aip.org >> dropping! >> >> I've seen that 3 hours is the default in some Nutch installations, but I've >> got my fetcher.timelimit.mins set to -1. I'm sure I'm missing something >> obvious. Any thoughts would be greatly appreciated. Thank you. >> >> Chip Calhoun >> Digital Archivist >> Niels Bohr Library & Archives >> American Institute of Physics >> One Physics Ellipse >> College Park, MD 20740-3840 USA >> Tel: +1 301-209-3180 >> Email: ccalh...@aip.org >> https://www.aip.org/history-programs/niels-bohr-library >> >
RE: Nutch fetching times out at 3 hours, not sure why.
Hi Sebastian, Thank you! Increasing my fetcher.threads.per.queue both fixed my crawl and saved me a lot of time. I'm still bewildered by the original problem, though. Both my fetcher.timelimit.mins and my fetcher.max.exceptions.per.queue are set to -1. I'll ignore it unless it causes a problem for my other cores. Chip -Original Message- From: Sebastian Nagel [mailto:wastl.na...@googlemail.com] Sent: Monday, April 30, 2018 12:21 PM To: user@nutch.apache.org Subject: Re: Nutch fetching times out at 3 hours, not sure why. Hi, if you still see the log message fetcher.FetchItemQueues - * queue: https://history.aip.org >> dropping! then it can be only - fetcher.timelimit.mins - fetcher.max.exceptions.per.queue > I crawl a list of roughly 2600 URLs all on my local server If this is the case you can crawl more aggressively, see fetcher.server.delay or even fetch in parallel from your host, see fetcher.threads.per.queue Best, Sebastian On 04/30/2018 04:44 PM, Chip Calhoun wrote: > I'm still experimenting with this. I had been crawling with a depth of 1 > because I don't need anything outside my URLs list, but I tried with a depth > of 10. It went through a crawl loop that ended after 3 hours, then a second 3 > hour crawl loop, then a third shorter loop. It still stopped 5 URLs short of > crawling every URL in my list, though it crawled a few I hadn't included. > > Are these 3 hour loops standard for large crawls? > > -Original Message- > From: Chip Calhoun [mailto:ccalh...@aip.org] > Sent: Tuesday, April 17, 2018 3:27 PM > To: user@nutch.apache.org > Subject: RE: Nutch fetching times out at 3 hours, not sure why. > > I'm on 1.12, and mine also defaulted at -1. It does not fail at the same URL, > or even at the same point in a URL's fetcher loop; it really seems to be time > based. > > -Original Message- > From: Sadiki Latty [mailto:sla...@uottawa.ca] > Sent: Tuesday, April 17, 2018 1:43 PM > To: user@nutch.apache.org > Subject: RE: Nutch fetching times out at 3 hours, not sure why. > > Which version are you running? That value is defaulted to -1 in my current > version (1.14) so shouldn't be something you should have needed to change. > My crawls, by default, go for as much as even 12 hours with little to no > tweaking necessary from the nutch-default. Something else is causing it. Is > it always the same URL that it fails at? > > -Original Message- > From: Chip Calhoun [mailto:ccalh...@aip.org] > Sent: April-17-18 10:45 AM > To: user@nutch.apache.org > Subject: Nutch fetching times out at 3 hours, not sure why. > > I crawl a list of roughly 2600 URLs all on my local server, and I'm only > crawling around 1000 of them. The fetcher quits after exactly 3 hours (give > or take a few milliseconds) with this message in the log: > > 2018-04-13 15:50:48,885 INFO fetcher.FetchItemQueues - * queue: > https://history.aip.org >> dropping! > > I've seen that 3 hours is the default in some Nutch installations, but I've > got my fetcher.timelimit.mins set to -1. I'm sure I'm missing something > obvious. Any thoughts would be greatly appreciated. Thank you. > > Chip Calhoun > Digital Archivist > Niels Bohr Library & Archives > American Institute of Physics > One Physics Ellipse > College Park, MD 20740-3840 USA > Tel: +1 301-209-3180 > Email: ccalh...@aip.org > https://www.aip.org/history-programs/niels-bohr-library >
Re: Nutch fetching times out at 3 hours, not sure why.
Hi, if you still see the log message fetcher.FetchItemQueues - * queue: https://history.aip.org >> dropping! then it can be only - fetcher.timelimit.mins - fetcher.max.exceptions.per.queue > I crawl a list of roughly 2600 URLs all on my local server If this is the case you can crawl more aggressively, see fetcher.server.delay or even fetch in parallel from your host, see fetcher.threads.per.queue Best, Sebastian On 04/30/2018 04:44 PM, Chip Calhoun wrote: > I'm still experimenting with this. I had been crawling with a depth of 1 > because I don't need anything outside my URLs list, but I tried with a depth > of 10. It went through a crawl loop that ended after 3 hours, then a second 3 > hour crawl loop, then a third shorter loop. It still stopped 5 URLs short of > crawling every URL in my list, though it crawled a few I hadn't included. > > Are these 3 hour loops standard for large crawls? > > -Original Message- > From: Chip Calhoun [mailto:ccalh...@aip.org] > Sent: Tuesday, April 17, 2018 3:27 PM > To: user@nutch.apache.org > Subject: RE: Nutch fetching times out at 3 hours, not sure why. > > I'm on 1.12, and mine also defaulted at -1. It does not fail at the same URL, > or even at the same point in a URL's fetcher loop; it really seems to be time > based. > > -Original Message- > From: Sadiki Latty [mailto:sla...@uottawa.ca] > Sent: Tuesday, April 17, 2018 1:43 PM > To: user@nutch.apache.org > Subject: RE: Nutch fetching times out at 3 hours, not sure why. > > Which version are you running? That value is defaulted to -1 in my current > version (1.14) so shouldn't be something you should have needed to change. > My crawls, by default, go for as much as even 12 hours with little to no > tweaking necessary from the nutch-default. Something else is causing it. Is > it always the same URL that it fails at? > > -Original Message- > From: Chip Calhoun [mailto:ccalh...@aip.org] > Sent: April-17-18 10:45 AM > To: user@nutch.apache.org > Subject: Nutch fetching times out at 3 hours, not sure why. > > I crawl a list of roughly 2600 URLs all on my local server, and I'm only > crawling around 1000 of them. The fetcher quits after exactly 3 hours (give > or take a few milliseconds) with this message in the log: > > 2018-04-13 15:50:48,885 INFO fetcher.FetchItemQueues - * queue: > https://history.aip.org >> dropping! > > I've seen that 3 hours is the default in some Nutch installations, but I've > got my fetcher.timelimit.mins set to -1. I'm sure I'm missing something > obvious. Any thoughts would be greatly appreciated. Thank you. > > Chip Calhoun > Digital Archivist > Niels Bohr Library & Archives > American Institute of Physics > One Physics Ellipse > College Park, MD 20740-3840 USA > Tel: +1 301-209-3180 > Email: ccalh...@aip.org > https://www.aip.org/history-programs/niels-bohr-library >
RE: Nutch fetching times out at 3 hours, not sure why.
I'm still experimenting with this. I had been crawling with a depth of 1 because I don't need anything outside my URLs list, but I tried with a depth of 10. It went through a crawl loop that ended after 3 hours, then a second 3 hour crawl loop, then a third shorter loop. It still stopped 5 URLs short of crawling every URL in my list, though it crawled a few I hadn't included. Are these 3 hour loops standard for large crawls? -Original Message- From: Chip Calhoun [mailto:ccalh...@aip.org] Sent: Tuesday, April 17, 2018 3:27 PM To: user@nutch.apache.org Subject: RE: Nutch fetching times out at 3 hours, not sure why. I'm on 1.12, and mine also defaulted at -1. It does not fail at the same URL, or even at the same point in a URL's fetcher loop; it really seems to be time based. -Original Message- From: Sadiki Latty [mailto:sla...@uottawa.ca] Sent: Tuesday, April 17, 2018 1:43 PM To: user@nutch.apache.org Subject: RE: Nutch fetching times out at 3 hours, not sure why. Which version are you running? That value is defaulted to -1 in my current version (1.14) so shouldn't be something you should have needed to change. My crawls, by default, go for as much as even 12 hours with little to no tweaking necessary from the nutch-default. Something else is causing it. Is it always the same URL that it fails at? -Original Message- From: Chip Calhoun [mailto:ccalh...@aip.org] Sent: April-17-18 10:45 AM To: user@nutch.apache.org Subject: Nutch fetching times out at 3 hours, not sure why. I crawl a list of roughly 2600 URLs all on my local server, and I'm only crawling around 1000 of them. The fetcher quits after exactly 3 hours (give or take a few milliseconds) with this message in the log: 2018-04-13 15:50:48,885 INFO fetcher.FetchItemQueues - * queue: https://history.aip.org >> dropping! I've seen that 3 hours is the default in some Nutch installations, but I've got my fetcher.timelimit.mins set to -1. I'm sure I'm missing something obvious. Any thoughts would be greatly appreciated. Thank you. Chip Calhoun Digital Archivist Niels Bohr Library & Archives American Institute of Physics One Physics Ellipse College Park, MD 20740-3840 USA Tel: +1 301-209-3180 Email: ccalh...@aip.org https://www.aip.org/history-programs/niels-bohr-library
RE: Nutch fetching times out at 3 hours, not sure why.
Hi Lewis, I'm using Nutch 1.2. Chip -Original Message- From: lewis john mcgibbney [mailto:lewi...@apache.org] Sent: Wednesday, April 18, 2018 1:55 PM To: user@nutch.apache.org Subject: Re: Nutch fetching times out at 3 hours, not sure why. Hi Chip, Which version of Nutch are you using? On Tue, Apr 17, 2018 at 7:45 AM, <user-digest-h...@nutch.apache.org> wrote: > From: Chip Calhoun <ccalh...@aip.org> > To: "user@nutch.apache.org" <user@nutch.apache.org> > Cc: > Bcc: > Date: Tue, 17 Apr 2018 14:45:01 + > Subject: Nutch fetching times out at 3 hours, not sure why. > I crawl a list of roughly 2600 URLs all on my local server, and I'm > only crawling around 1000 of them. The fetcher quits after exactly 3 > hours (give or take a few milliseconds) with this message in the log: > > 2018-04-13 15:50:48,885 INFO fetcher.FetchItemQueues - * queue: > https://history.aip.org >> dropping! > > I've seen that 3 hours is the default in some Nutch installations, but > I've got my fetcher.timelimit.mins set to -1. I'm sure I'm missing > something obvious. Any thoughts would be greatly appreciated. Thank you. > > Chip Calhoun > Digital Archivist > Niels Bohr Library & Archives > American Institute of Physics > One Physics Ellipse > College Park, MD 20740-3840 USA > Tel: +1 301-209-3180 > Email: ccalh...@aip.org > https://www.aip.org/history-programs/niels-bohr-library > > > -- http://home.apache.org/~lewismc/ http://people.apache.org/keys/committer/lewismc
RE: Nutch fetching times out at 3 hours, not sure why.
Hi Markus, I don't see an indication of the web server blocking me, though that sounds reasonable. Could there be a per-server limit in Nutch itself that we're overlooking, since this is all on the same server? Chip -Original Message- From: Markus Jelsma [mailto:markus.jel...@openindex.io] Sent: Tuesday, April 17, 2018 3:58 PM To: user@nutch.apache.org Subject: RE: Nutch fetching times out at 3 hours, not sure why. Hello Chip, I have no clue where the three hour limit could come from. Please take a further look in the last few minutes of the logs. The only thing i can think of is that a webserver would block you after some amount of requests/time window, that would be visible in the logs. It is clear Nutch itself terminates the fetcher (the dropping line). That is only possible with an imposed time limit, or a if you reached some number of exceptions (or one other variable i am forgetting). Regards, Markus -Original message- > From:Chip Calhoun <ccalh...@aip.org> > Sent: Tuesday 17th April 2018 21:27 > To: user@nutch.apache.org > Subject: RE: Nutch fetching times out at 3 hours, not sure why. > > I'm on 1.12, and mine also defaulted at -1. It does not fail at the same URL, > or even at the same point in a URL's fetcher loop; it really seems to be time > based. > > -Original Message- > From: Sadiki Latty [mailto:sla...@uottawa.ca] > Sent: Tuesday, April 17, 2018 1:43 PM > To: user@nutch.apache.org > Subject: RE: Nutch fetching times out at 3 hours, not sure why. > > Which version are you running? That value is defaulted to -1 in my current > version (1.14) so shouldn't be something you should have needed to change. > My crawls, by default, go for as much as even 12 hours with little to no > tweaking necessary from the nutch-default. Something else is causing it. Is > it always the same URL that it fails at? > > -Original Message- > From: Chip Calhoun [mailto:ccalh...@aip.org] > Sent: April-17-18 10:45 AM > To: user@nutch.apache.org > Subject: Nutch fetching times out at 3 hours, not sure why. > > I crawl a list of roughly 2600 URLs all on my local server, and I'm only > crawling around 1000 of them. The fetcher quits after exactly 3 hours (give > or take a few milliseconds) with this message in the log: > > 2018-04-13 15:50:48,885 INFO fetcher.FetchItemQueues - * queue: > https://history.aip.org >> dropping! > > I've seen that 3 hours is the default in some Nutch installations, but I've > got my fetcher.timelimit.mins set to -1. I'm sure I'm missing something > obvious. Any thoughts would be greatly appreciated. Thank you. > > Chip Calhoun > Digital Archivist > Niels Bohr Library & Archives > American Institute of Physics > One Physics Ellipse > College Park, MD 20740-3840 USA > Tel: +1 301-209-3180 > Email: ccalh...@aip.org > https://www.aip.org/history-programs/niels-bohr-library > >
Re: Nutch fetching times out at 3 hours, not sure why.
Hi Chip, Which version of Nutch are you using? On Tue, Apr 17, 2018 at 7:45 AM,wrote: > From: Chip Calhoun > To: "user@nutch.apache.org" > Cc: > Bcc: > Date: Tue, 17 Apr 2018 14:45:01 + > Subject: Nutch fetching times out at 3 hours, not sure why. > I crawl a list of roughly 2600 URLs all on my local server, and I'm only > crawling around 1000 of them. The fetcher quits after exactly 3 hours (give > or take a few milliseconds) with this message in the log: > > 2018-04-13 15:50:48,885 INFO fetcher.FetchItemQueues - * queue: > https://history.aip.org >> dropping! > > I've seen that 3 hours is the default in some Nutch installations, but > I've got my fetcher.timelimit.mins set to -1. I'm sure I'm missing > something obvious. Any thoughts would be greatly appreciated. Thank you. > > Chip Calhoun > Digital Archivist > Niels Bohr Library & Archives > American Institute of Physics > One Physics Ellipse > College Park, MD 20740-3840 USA > Tel: +1 301-209-3180 > Email: ccalh...@aip.org > https://www.aip.org/history-programs/niels-bohr-library > > > -- http://home.apache.org/~lewismc/ http://people.apache.org/keys/committer/lewismc
RE: Nutch fetching times out at 3 hours, not sure why.
Hello Chip, I have no clue where the three hour limit could come from. Please take a further look in the last few minutes of the logs. The only thing i can think of is that a webserver would block you after some amount of requests/time window, that would be visible in the logs. It is clear Nutch itself terminates the fetcher (the dropping line). That is only possible with an imposed time limit, or a if you reached some number of exceptions (or one other variable i am forgetting). Regards, Markus -Original message- > From:Chip Calhoun <ccalh...@aip.org> > Sent: Tuesday 17th April 2018 21:27 > To: user@nutch.apache.org > Subject: RE: Nutch fetching times out at 3 hours, not sure why. > > I'm on 1.12, and mine also defaulted at -1. It does not fail at the same URL, > or even at the same point in a URL's fetcher loop; it really seems to be time > based. > > -Original Message- > From: Sadiki Latty [mailto:sla...@uottawa.ca] > Sent: Tuesday, April 17, 2018 1:43 PM > To: user@nutch.apache.org > Subject: RE: Nutch fetching times out at 3 hours, not sure why. > > Which version are you running? That value is defaulted to -1 in my current > version (1.14) so shouldn't be something you should have needed to change. > My crawls, by default, go for as much as even 12 hours with little to no > tweaking necessary from the nutch-default. Something else is causing it. Is > it always the same URL that it fails at? > > -Original Message- > From: Chip Calhoun [mailto:ccalh...@aip.org] > Sent: April-17-18 10:45 AM > To: user@nutch.apache.org > Subject: Nutch fetching times out at 3 hours, not sure why. > > I crawl a list of roughly 2600 URLs all on my local server, and I'm only > crawling around 1000 of them. The fetcher quits after exactly 3 hours (give > or take a few milliseconds) with this message in the log: > > 2018-04-13 15:50:48,885 INFO fetcher.FetchItemQueues - * queue: > https://history.aip.org >> dropping! > > I've seen that 3 hours is the default in some Nutch installations, but I've > got my fetcher.timelimit.mins set to -1. I'm sure I'm missing something > obvious. Any thoughts would be greatly appreciated. Thank you. > > Chip Calhoun > Digital Archivist > Niels Bohr Library & Archives > American Institute of Physics > One Physics Ellipse > College Park, MD 20740-3840 USA > Tel: +1 301-209-3180 > Email: ccalh...@aip.org > https://www.aip.org/history-programs/niels-bohr-library > >
RE: Nutch fetching times out at 3 hours, not sure why.
I'm on 1.12, and mine also defaulted at -1. It does not fail at the same URL, or even at the same point in a URL's fetcher loop; it really seems to be time based. -Original Message- From: Sadiki Latty [mailto:sla...@uottawa.ca] Sent: Tuesday, April 17, 2018 1:43 PM To: user@nutch.apache.org Subject: RE: Nutch fetching times out at 3 hours, not sure why. Which version are you running? That value is defaulted to -1 in my current version (1.14) so shouldn't be something you should have needed to change. My crawls, by default, go for as much as even 12 hours with little to no tweaking necessary from the nutch-default. Something else is causing it. Is it always the same URL that it fails at? -Original Message- From: Chip Calhoun [mailto:ccalh...@aip.org] Sent: April-17-18 10:45 AM To: user@nutch.apache.org Subject: Nutch fetching times out at 3 hours, not sure why. I crawl a list of roughly 2600 URLs all on my local server, and I'm only crawling around 1000 of them. The fetcher quits after exactly 3 hours (give or take a few milliseconds) with this message in the log: 2018-04-13 15:50:48,885 INFO fetcher.FetchItemQueues - * queue: https://history.aip.org >> dropping! I've seen that 3 hours is the default in some Nutch installations, but I've got my fetcher.timelimit.mins set to -1. I'm sure I'm missing something obvious. Any thoughts would be greatly appreciated. Thank you. Chip Calhoun Digital Archivist Niels Bohr Library & Archives American Institute of Physics One Physics Ellipse College Park, MD 20740-3840 USA Tel: +1 301-209-3180 Email: ccalh...@aip.org https://www.aip.org/history-programs/niels-bohr-library
RE: Nutch fetching times out at 3 hours, not sure why.
Which version are you running? That value is defaulted to -1 in my current version (1.14) so shouldn't be something you should have needed to change. My crawls, by default, go for as much as even 12 hours with little to no tweaking necessary from the nutch-default. Something else is causing it. Is it always the same URL that it fails at? -Original Message- From: Chip Calhoun [mailto:ccalh...@aip.org] Sent: April-17-18 10:45 AM To: user@nutch.apache.org Subject: Nutch fetching times out at 3 hours, not sure why. I crawl a list of roughly 2600 URLs all on my local server, and I'm only crawling around 1000 of them. The fetcher quits after exactly 3 hours (give or take a few milliseconds) with this message in the log: 2018-04-13 15:50:48,885 INFO fetcher.FetchItemQueues - * queue: https://history.aip.org >> dropping! I've seen that 3 hours is the default in some Nutch installations, but I've got my fetcher.timelimit.mins set to -1. I'm sure I'm missing something obvious. Any thoughts would be greatly appreciated. Thank you. Chip Calhoun Digital Archivist Niels Bohr Library & Archives American Institute of Physics One Physics Ellipse College Park, MD 20740-3840 USA Tel: +1 301-209-3180 Email: ccalh...@aip.org https://www.aip.org/history-programs/niels-bohr-library