This is the issue indeed. With these settings you won't even fetch a million pages in one week of one domain.
One request per second is not too polite, it's usually considered invasive especially if it continues for a week. You may want to contact the webmasters or be patient. If you increase to 2 pages/s and keep downloading for 7 more days you're likely to be rejected. On Tuesday 30 August 2011 13:20:26 Eggebrecht, Thomas (GfK Marktforschung) wrote: > Hi Markus, > my conf: > fetcher.server.delay: 1 > fetcher.threads.fetch: 10 > fetcher.threads.per.host: 1 > > One URL per second from each host. It may be too polite? These all very big > sites with potent resources but I don't want my IP to be rejected. What > would you operate? > > Regards > Thomas > > -----Original Message----- > From: Markus Jelsma [mailto:[email protected]] > Sent: Tuesday, August 30, 2011 1:07 PM > To: [email protected] > Subject: Re: Parameter tuning or how to accelerate fetching > > Hmm, i actually first expected some domains to have specified a crawl > delay. Did you customize your fetcher configuration? > > - fetcher.server.delay > - fetcher.threads.fetch > - fetcher.threads.per.host > > > Take care, this is going to change slightly in NUTCH-1073 > > On Tuesday 30 August 2011 12:55:56 Eggebrecht, Thomas (GfK Marktforschung) > > wrote: > > Now I checked all robots.txt. They all allow me to enter the areas of > > my interest. Furthermore if I run a search with NutchBean I see all my > > desired domains quite equally represented: total hits: 91870 > > 13 domain(s): > > {www.carmondo.de=3903, www.autoextrem.de=1983, www.kbb.com=3706, > > www.motor-talk.de=7366, www.auto-motor-und-sport.de=12842, > > www.hunny.de=13107, www.pkw-forum.de=3605, forum.autobild.de=1640, > > www.carmagazine.co.uk=3740, www.bmw-syndikat.de=33305, > > community.evo.co.uk=3577, www.pistonheads.com=1450, > > www.edmunds.com=1646} > > > > I think robots.txt can't be the reason. > > > > Regards > > Thomas > > > > -----Ursprüngliche Nachricht----- > > Von: Markus Jelsma [mailto:[email protected]] > > Gesendet: Dienstag, 30. August 2011 12:24 > > An: [email protected] > > Cc: Eggebrecht, Thomas (GfK Marktforschung) > > Betreff: Re: AW: Parameter tuning or how to accelerate fetching > > > > Your questions was valid: why is my fetch too slow and how to accelerate? > > > > Again, first check your robots.txt. With so few domains it's almost > > certain that politeness is the problem here. > > > > > Hi List, > > > Hi Hannes, > > > > > > All logs are without Errors and Warnings. Injecting, Updating, > > > merging and indexing is not a problem and takes minutes only. One > > > cycle takes > > > 2 days with my parameters. Regex-urlfilter.txt is checked against > > > the URL format from all sites. > > > > > > But I'm sorry to the list, I may have not clear asked. I'm > > > interested mainly why there is such big difference between fetched > > > and unfetched URLs and what can I do to force fetching? > > > > > > Please see my current readdb -stats output: > > > TOTAL urls: 1698520 > > > [...] > > > status 1 (db_unfetched): 1567047 > > > status 2 (db_fetched): 90399 > > > status 3 (db_gone): 11696 > > > status 4 (db_redir_temp): 4065 > > > status 5 (db_redir_perm): 10137 > > > status 6 (db_notmodified): 15176 > > > > > > The process runs now exactly 30 days. In the meantime I have now > > > 90,399 fetched instead of 30,000 after 15 days. Is this normal? > > > > > > Regards > > > Thomas > > > > > > Von: Hannes Carl Meyer [mailto:[email protected]] > > > Gesendet: Dienstag, 30. August 2011 09:25 > > > An: [email protected] > > > Cc: Eggebrecht, Thomas (GfK Marktforschung) > > > Betreff: Re: Parameter tuning or how to accelerate fetching > > > > > > Hi Thomas, > > > > > > first, 30,000 pages in two weeks is somewhat of few... > > > > > > where did you get the total number of pages from? By Crawl-DB? > > > Please post a bin/nutch readdb crawldb/ -stats output here. > > > > > > How long does one cycle takes? > > > > > > If your regex-urlfilter.txt is still the standard setting, check > > > your websites for common query URLs containing like > > > "index.php?param=value¶m1..". The standard regex-urlfilter is > > > sometimes very strict in this case. > > > > > > BR > > > > > > Hannes > > > > > > -- > > > > > > https://www.xing.com/profile/HannesCarl_Meyer > > > http://de.linkedin.com/in/hannescarlmeyer > > > On Mon, Aug 29, 2011 at 5:33 PM, Eggebrecht, Thomas (GfK > > > Marktforschung) > > > <[email protected]<mailto:[email protected]>> wrote: > > > Dear List, > > > > > > My process fetches only 10 but very big domains with millions of > > > pages on each site. I now wonder way I got after 2 weeks and 17 > > > crawl-fetch cycles only a handful of about 30,000 pages and it seems > > > stagnating. > > > > > > How would you accelerate fetching? > > > > > > My current parameters (using Nutch-1.2): > > > topN: 40,000 > > > depth: 8 > > > adddays: 30 > > > fetcher.server.delay: 1 > > > db.max.outlinks.per.page: 500 > > > > > > All parameters not mentioned have standard values as well as > > > regex-urlfilter.txt. > > > > > > Best Regards > > > Thomas > > -- > Markus Jelsma - CTO - Openindex > http://www.linkedin.com/in/markus17 > 050-8536620 / 06-50258350 > > > GfK SE, Nuremberg, Germany, commercial register Nuremberg HRB 25014; > Management Board: Professor Dr. Klaus L. Wübbenhorst (CEO), Pamela Knapp > (CFO), Dr. Gerhard Hausruckinger, Petra Heinlein, Debra A. Pruent, Wilhelm > R. Wessels; Chairman of the Supervisory Board: Dr. Arno Mahlert This email > and any attachments may contain confidential or privileged information. > Please note that unauthorized copying, disclosure or distribution of the > material in this email is not permitted. -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350

