Hi Thomas, This seems a perfect situation for running Nutch jobs in a cluster Hadoop setup, if you have the resources. From the length of your crawl (2 weeks) and the erecursive number of cycles, t is inherently hard for anyone, let alone yourself begin to provide accurate answers to this query. I would begin with logs... generic search for FATAL, WARN or ERROR as per commons logging levels will certainly return all instances which may lead to some kind of answers.
On Mon, Aug 29, 2011 at 4:33 PM, Eggebrecht, Thomas (GfK Marktforschung) < [email protected]> wrote: > Dear List, > > My process fetches only 10 but very big domains with millions of pages on > each site. I now wonder way I got after 2 weeks and 17 crawl-fetch cycles > only a handful of about 30,000 pages and it seems stagnating. > > How would you accelerate fetching? > > My current parameters (using Nutch-1.2): > topN: 40,000 > depth: 8 > adddays: 30 > fetcher.server.delay: 1 > db.max.outlinks.per.page: 500 > > All parameters not mentioned have standard values as well as > regex-urlfilter.txt. > > Best Regards > Thomas > > > ________________________________ > > GfK SE, Nuremberg, Germany, commercial register Nuremberg HRB 25014; > Management Board: Professor Dr. Klaus L. W?bbenhorst (CEO), Pamela Knapp > (CFO), Dr. Gerhard Hausruckinger, Petra Heinlein, Debra A. Pruent, Wilhelm > R. Wessels; Chairman of the Supervisory Board: Dr. Arno Mahlert > This email and any attachments may contain confidential or privileged > information. Please note that unauthorized copying, disclosure or > distribution of the material in this email is not permitted. > -- *Lewis*

