Now I checked all robots.txt. They all allow me to enter the areas of my
interest. Furthermore if I run a search with NutchBean I see all my desired
domains quite equally represented:
total hits: 91870
13 domain(s):
{www.carmondo.de=3903, www.autoextrem.de=1983, www.kbb.com=3706,
www.motor-talk.de=7366, www.auto-motor-und-sport.de=12842, www.hunny.de=13107,
www.pkw-forum.de=3605, forum.autobild.de=1640, www.carmagazine.co.uk=3740,
www.bmw-syndikat.de=33305, community.evo.co.uk=3577, www.pistonheads.com=1450,
www.edmunds.com=1646}
I think robots.txt can't be the reason.
Regards
Thomas
-----Ursprüngliche Nachricht-----
Von: Markus Jelsma [mailto:[email protected]]
Gesendet: Dienstag, 30. August 2011 12:24
An: [email protected]
Cc: Eggebrecht, Thomas (GfK Marktforschung)
Betreff: Re: AW: Parameter tuning or how to accelerate fetching
Your questions was valid: why is my fetch too slow and how to accelerate?
Again, first check your robots.txt. With so few domains it's almost certain
that politeness is the problem here.
> Hi List,
> Hi Hannes,
>
> All logs are without Errors and Warnings. Injecting, Updating, merging
> and indexing is not a problem and takes minutes only. One cycle takes
> 2 days with my parameters. Regex-urlfilter.txt is checked against the
> URL format from all sites.
>
> But I'm sorry to the list, I may have not clear asked. I'm interested
> mainly why there is such big difference between fetched and unfetched
> URLs and what can I do to force fetching?
>
> Please see my current readdb -stats output:
> TOTAL urls: 1698520
> [...]
> status 1 (db_unfetched): 1567047
> status 2 (db_fetched): 90399
> status 3 (db_gone): 11696
> status 4 (db_redir_temp): 4065
> status 5 (db_redir_perm): 10137
> status 6 (db_notmodified): 15176
>
> The process runs now exactly 30 days. In the meantime I have now
> 90,399 fetched instead of 30,000 after 15 days. Is this normal?
>
> Regards
> Thomas
>
> Von: Hannes Carl Meyer [mailto:[email protected]]
> Gesendet: Dienstag, 30. August 2011 09:25
> An: [email protected]
> Cc: Eggebrecht, Thomas (GfK Marktforschung)
> Betreff: Re: Parameter tuning or how to accelerate fetching
>
> Hi Thomas,
>
> first, 30,000 pages in two weeks is somewhat of few...
>
> where did you get the total number of pages from? By Crawl-DB?
> Please post a bin/nutch readdb crawldb/ -stats output here.
>
> How long does one cycle takes?
>
> If your regex-urlfilter.txt is still the standard setting, check your
> websites for common query URLs containing like
> "index.php?param=value¶m1..". The standard regex-urlfilter is
> sometimes very strict in this case.
>
> BR
>
> Hannes
>
> --
>
> https://www.xing.com/profile/HannesCarl_Meyer
> http://de.linkedin.com/in/hannescarlmeyer
> On Mon, Aug 29, 2011 at 5:33 PM, Eggebrecht, Thomas (GfK
> Marktforschung)
> <[email protected]<mailto:[email protected]>> wrote:
> Dear List,
>
> My process fetches only 10 but very big domains with millions of pages
> on each site. I now wonder way I got after 2 weeks and 17 crawl-fetch
> cycles only a handful of about 30,000 pages and it seems stagnating.
>
> How would you accelerate fetching?
>
> My current parameters (using Nutch-1.2):
> topN: 40,000
> depth: 8
> adddays: 30
> fetcher.server.delay: 1
> db.max.outlinks.per.page: 500
>
> All parameters not mentioned have standard values as well as
> regex-urlfilter.txt.
>
> Best Regards
> Thomas
GfK SE, Nuremberg, Germany, commercial register Nuremberg HRB 25014; Management
Board: Professor Dr. Klaus L. Wübbenhorst (CEO), Pamela Knapp (CFO), Dr.
Gerhard Hausruckinger, Petra Heinlein, Debra A. Pruent, Wilhelm R. Wessels;
Chairman of the Supervisory Board: Dr. Arno Mahlert
This email and any attachments may contain confidential or privileged
information. Please note that unauthorized copying, disclosure or distribution
of the material in this email is not permitted.