Hi List, Hi Hannes, All logs are without Errors and Warnings. Injecting, Updating, merging and indexing is not a problem and takes minutes only. One cycle takes 2 days with my parameters. Regex-urlfilter.txt is checked against the URL format from all sites.
But I'm sorry to the list, I may have not clear asked. I'm interested mainly why there is such big difference between fetched and unfetched URLs and what can I do to force fetching? Please see my current readdb -stats output: TOTAL urls: 1698520 [...] status 1 (db_unfetched): 1567047 status 2 (db_fetched): 90399 status 3 (db_gone): 11696 status 4 (db_redir_temp): 4065 status 5 (db_redir_perm): 10137 status 6 (db_notmodified): 15176 The process runs now exactly 30 days. In the meantime I have now 90,399 fetched instead of 30,000 after 15 days. Is this normal? Regards Thomas Von: Hannes Carl Meyer [mailto:[email protected]] Gesendet: Dienstag, 30. August 2011 09:25 An: [email protected] Cc: Eggebrecht, Thomas (GfK Marktforschung) Betreff: Re: Parameter tuning or how to accelerate fetching Hi Thomas, first, 30,000 pages in two weeks is somewhat of few... where did you get the total number of pages from? By Crawl-DB? Please post a bin/nutch readdb crawldb/ -stats output here. How long does one cycle takes? If your regex-urlfilter.txt is still the standard setting, check your websites for common query URLs containing like "index.php?param=value¶m1..". The standard regex-urlfilter is sometimes very strict in this case. BR Hannes -- https://www.xing.com/profile/HannesCarl_Meyer http://de.linkedin.com/in/hannescarlmeyer On Mon, Aug 29, 2011 at 5:33 PM, Eggebrecht, Thomas (GfK Marktforschung) <[email protected]<mailto:[email protected]>> wrote: Dear List, My process fetches only 10 but very big domains with millions of pages on each site. I now wonder way I got after 2 weeks and 17 crawl-fetch cycles only a handful of about 30,000 pages and it seems stagnating. How would you accelerate fetching? My current parameters (using Nutch-1.2): topN: 40,000 depth: 8 adddays: 30 fetcher.server.delay: 1 db.max.outlinks.per.page: 500 All parameters not mentioned have standard values as well as regex-urlfilter.txt. Best Regards Thomas ________________________________ GfK SE, Nuremberg, Germany, commercial register Nuremberg HRB 25014; Management Board: Professor Dr. Klaus L. W?bbenhorst (CEO), Pamela Knapp (CFO), Dr. Gerhard Hausruckinger, Petra Heinlein, Debra A. Pruent, Wilhelm R. Wessels; Chairman of the Supervisory Board: Dr. Arno Mahlert This email and any attachments may contain confidential or privileged information. Please note that unauthorized copying, disclosure or distribution of the material in this email is not permitted. ________________________________ GfK SE, Nuremberg, Germany, commercial register Nuremberg HRB 25014; Management Board: Professor Dr. Klaus L. W?bbenhorst (CEO), Pamela Knapp (CFO), Dr. Gerhard Hausruckinger, Petra Heinlein, Debra A. Pruent, Wilhelm R. Wessels; Chairman of the Supervisory Board: Dr. Arno Mahlert This email and any attachments may contain confidential or privileged information. Please note that unauthorized copying, disclosure or distribution of the material in this email is not permitted.

