Now I checked all robots.txt. They all allow me to enter the areas of my 
interest. Furthermore if I run a search with NutchBean I see all my desired 
domains quite equally represented:
total hits: 91870
13 domain(s):
{www.carmondo.de=3903, www.autoextrem.de=1983, www.kbb.com=3706, 
www.motor-talk.de=7366, www.auto-motor-und-sport.de=12842, www.hunny.de=13107, 
www.pkw-forum.de=3605, forum.autobild.de=1640, www.carmagazine.co.uk=3740, 
www.bmw-syndikat.de=33305, community.evo.co.uk=3577, www.pistonheads.com=1450, 
www.edmunds.com=1646}

I think robots.txt can't be the reason.

Regards
Thomas

-----Ursprüngliche Nachricht-----
Von: Markus Jelsma [mailto:[email protected]]
Gesendet: Dienstag, 30. August 2011 12:24
An: [email protected]
Cc: Eggebrecht, Thomas (GfK Marktforschung)
Betreff: Re: AW: Parameter tuning or how to accelerate fetching

Your questions was valid: why is my fetch too slow and how to accelerate?

Again, first check your robots.txt. With so few domains it's almost certain 
that politeness is the problem here.

> Hi List,
> Hi Hannes,
>
> All logs are without Errors and Warnings. Injecting, Updating, merging
> and indexing is not a problem and takes minutes only. One cycle takes
> 2 days with my parameters. Regex-urlfilter.txt is checked against the
> URL format from all sites.
>
> But I'm sorry to the list, I may have not clear asked. I'm interested
> mainly why there is such big difference between fetched and unfetched
> URLs and what can I do to force fetching?
>
> Please see my current readdb -stats output:
> TOTAL urls: 1698520
> [...]
> status 1 (db_unfetched): 1567047
> status 2 (db_fetched): 90399
> status 3 (db_gone): 11696
> status 4 (db_redir_temp): 4065
> status 5 (db_redir_perm): 10137
> status 6 (db_notmodified): 15176
>
> The process runs now exactly 30 days. In the meantime I have now
> 90,399 fetched instead of 30,000 after 15 days. Is this normal?
>
> Regards
> Thomas
>
> Von: Hannes Carl Meyer [mailto:[email protected]]
> Gesendet: Dienstag, 30. August 2011 09:25
> An: [email protected]
> Cc: Eggebrecht, Thomas (GfK Marktforschung)
> Betreff: Re: Parameter tuning or how to accelerate fetching
>
> Hi Thomas,
>
> first, 30,000 pages in two weeks is somewhat of few...
>
> where did you get the total number of pages from? By Crawl-DB?
> Please post a bin/nutch readdb crawldb/ -stats output here.
>
> How long does one cycle takes?
>
> If your regex-urlfilter.txt is still the standard setting, check your
> websites for common query URLs containing like
> "index.php?param=value&param1..". The standard regex-urlfilter is
> sometimes very strict in this case.
>
> BR
>
> Hannes
>
> --
>
> https://www.xing.com/profile/HannesCarl_Meyer
> http://de.linkedin.com/in/hannescarlmeyer
> On Mon, Aug 29, 2011 at 5:33 PM, Eggebrecht, Thomas (GfK
> Marktforschung)
> <[email protected]<mailto:[email protected]>> wrote:
> Dear List,
>
> My process fetches only 10 but very big domains with millions of pages
> on each site. I now wonder way I got after 2 weeks and 17 crawl-fetch
> cycles only a handful of about 30,000 pages and it seems stagnating.
>
> How would you accelerate fetching?
>
> My current parameters (using Nutch-1.2):
> topN: 40,000
> depth: 8
> adddays: 30
> fetcher.server.delay: 1
> db.max.outlinks.per.page: 500
>
> All parameters not mentioned have standard values as well as
> regex-urlfilter.txt.
>
> Best Regards
> Thomas


GfK SE, Nuremberg, Germany, commercial register Nuremberg HRB 25014; Management 
Board: Professor Dr. Klaus L. Wübbenhorst (CEO), Pamela Knapp (CFO), Dr. 
Gerhard Hausruckinger, Petra Heinlein, Debra A. Pruent, Wilhelm R. Wessels; 
Chairman of the Supervisory Board: Dr. Arno Mahlert
This email and any attachments may contain confidential or privileged 
information. Please note that unauthorized copying, disclosure or distribution 
of the material in this email is not permitted.

Reply via email to