Hi Thomas, first, 30,000 pages in two weeks is somewhat of few...
where did you get the total number of pages from? By Crawl-DB? Please post a bin/nutch readdb crawldb/ -stats output here. How long does one cycle takes? If your regex-urlfilter.txt is still the standard setting, check your websites for common query URLs containing like "index.php?param=value¶m1..". The standard regex-urlfilter is sometimes very strict in this case. BR Hannes -- https://www.xing.com/profile/HannesCarl_Meyer http://de.linkedin.com/in/hannescarlmeyer On Mon, Aug 29, 2011 at 5:33 PM, Eggebrecht, Thomas (GfK Marktforschung) < [email protected]> wrote: > Dear List, > > My process fetches only 10 but very big domains with millions of pages on > each site. I now wonder way I got after 2 weeks and 17 crawl-fetch cycles > only a handful of about 30,000 pages and it seems stagnating. > > How would you accelerate fetching? > > My current parameters (using Nutch-1.2): > topN: 40,000 > depth: 8 > adddays: 30 > fetcher.server.delay: 1 > db.max.outlinks.per.page: 500 > > All parameters not mentioned have standard values as well as > regex-urlfilter.txt. > > Best Regards > Thomas > > > ________________________________ > > GfK SE, Nuremberg, Germany, commercial register Nuremberg HRB 25014; > Management Board: Professor Dr. Klaus L. W?bbenhorst (CEO), Pamela Knapp > (CFO), Dr. Gerhard Hausruckinger, Petra Heinlein, Debra A. Pruent, Wilhelm > R. Wessels; Chairman of the Supervisory Board: Dr. Arno Mahlert > This email and any attachments may contain confidential or privileged > information. Please note that unauthorized copying, disclosure or > distribution of the material in this email is not permitted. >

