Hi Markus,
my conf:
fetcher.server.delay: 1
fetcher.threads.fetch: 10
fetcher.threads.per.host: 1

One URL per second from each host. It may be too polite? These all very big 
sites with potent resources but I don't want my IP to be rejected. What would 
you operate?

Regards
Thomas

-----Original Message-----
From: Markus Jelsma [mailto:[email protected]]
Sent: Tuesday, August 30, 2011 1:07 PM
To: [email protected]
Subject: Re: Parameter tuning or how to accelerate fetching

Hmm, i actually first expected some domains to have specified a crawl delay.
Did you customize your fetcher configuration?

- fetcher.server.delay
- fetcher.threads.fetch
- fetcher.threads.per.host


Take care, this is going to change slightly in NUTCH-1073

On Tuesday 30 August 2011 12:55:56 Eggebrecht, Thomas (GfK Marktforschung)
wrote:
> Now I checked all robots.txt. They all allow me to enter the areas of
> my interest. Furthermore if I run a search with NutchBean I see all my
> desired domains quite equally represented: total hits: 91870
> 13 domain(s):
> {www.carmondo.de=3903, www.autoextrem.de=1983, www.kbb.com=3706,
> www.motor-talk.de=7366, www.auto-motor-und-sport.de=12842,
> www.hunny.de=13107, www.pkw-forum.de=3605, forum.autobild.de=1640,
> www.carmagazine.co.uk=3740, www.bmw-syndikat.de=33305,
> community.evo.co.uk=3577, www.pistonheads.com=1450,
> www.edmunds.com=1646}
>
> I think robots.txt can't be the reason.
>
> Regards
> Thomas
>
> -----Ursprüngliche Nachricht-----
> Von: Markus Jelsma [mailto:[email protected]]
> Gesendet: Dienstag, 30. August 2011 12:24
> An: [email protected]
> Cc: Eggebrecht, Thomas (GfK Marktforschung)
> Betreff: Re: AW: Parameter tuning or how to accelerate fetching
>
> Your questions was valid: why is my fetch too slow and how to accelerate?
>
> Again, first check your robots.txt. With so few domains it's almost
> certain that politeness is the problem here.
>
> > Hi List,
> > Hi Hannes,
> >
> > All logs are without Errors and Warnings. Injecting, Updating,
> > merging and indexing is not a problem and takes minutes only. One
> > cycle takes
> > 2 days with my parameters. Regex-urlfilter.txt is checked against
> > the URL format from all sites.
> >
> > But I'm sorry to the list, I may have not clear asked. I'm
> > interested mainly why there is such big difference between fetched
> > and unfetched URLs and what can I do to force fetching?
> >
> > Please see my current readdb -stats output:
> > TOTAL urls: 1698520
> > [...]
> > status 1 (db_unfetched): 1567047
> > status 2 (db_fetched): 90399
> > status 3 (db_gone): 11696
> > status 4 (db_redir_temp): 4065
> > status 5 (db_redir_perm): 10137
> > status 6 (db_notmodified): 15176
> >
> > The process runs now exactly 30 days. In the meantime I have now
> > 90,399 fetched instead of 30,000 after 15 days. Is this normal?
> >
> > Regards
> > Thomas
> >
> > Von: Hannes Carl Meyer [mailto:[email protected]]
> > Gesendet: Dienstag, 30. August 2011 09:25
> > An: [email protected]
> > Cc: Eggebrecht, Thomas (GfK Marktforschung)
> > Betreff: Re: Parameter tuning or how to accelerate fetching
> >
> > Hi Thomas,
> >
> > first, 30,000 pages in two weeks is somewhat of few...
> >
> > where did you get the total number of pages from? By Crawl-DB?
> > Please post a bin/nutch readdb crawldb/ -stats output here.
> >
> > How long does one cycle takes?
> >
> > If your regex-urlfilter.txt is still the standard setting, check
> > your websites for common query URLs containing like
> > "index.php?param=value&param1..". The standard regex-urlfilter is
> > sometimes very strict in this case.
> >
> > BR
> >
> > Hannes
> >
> > --
> >
> > https://www.xing.com/profile/HannesCarl_Meyer
> > http://de.linkedin.com/in/hannescarlmeyer
> > On Mon, Aug 29, 2011 at 5:33 PM, Eggebrecht, Thomas (GfK
> > Marktforschung)
> > <[email protected]<mailto:[email protected]>> wrote:
> > Dear List,
> >
> > My process fetches only 10 but very big domains with millions of
> > pages on each site. I now wonder way I got after 2 weeks and 17
> > crawl-fetch cycles only a handful of about 30,000 pages and it seems 
> > stagnating.
> >
> > How would you accelerate fetching?
> >
> > My current parameters (using Nutch-1.2):
> > topN: 40,000
> > depth: 8
> > adddays: 30
> > fetcher.server.delay: 1
> > db.max.outlinks.per.page: 500
> >
> > All parameters not mentioned have standard values as well as
> > regex-urlfilter.txt.
> >
> > Best Regards
> > Thomas
--
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350


GfK SE, Nuremberg, Germany, commercial register Nuremberg HRB 25014; Management 
Board: Professor Dr. Klaus L. Wübbenhorst (CEO), Pamela Knapp (CFO), Dr. 
Gerhard Hausruckinger, Petra Heinlein, Debra A. Pruent, Wilhelm R. Wessels; 
Chairman of the Supervisory Board: Dr. Arno Mahlert
This email and any attachments may contain confidential or privileged 
information. Please note that unauthorized copying, disclosure or distribution 
of the material in this email is not permitted.

Reply via email to