Hi Thomas,

This seems a perfect situation for running Nutch jobs in a cluster Hadoop
setup, if you have the resources. From the length of your crawl (2 weeks)
and the erecursive number of cycles, t is inherently hard for anyone, let
alone yourself begin to provide accurate answers to this query. I would
begin with logs... generic search for FATAL, WARN or ERROR as per commons
logging levels will certainly return all instances which may lead to some
kind of answers.

On Mon, Aug 29, 2011 at 4:33 PM, Eggebrecht, Thomas (GfK Marktforschung) <
[email protected]> wrote:

> Dear List,
>
> My process fetches only 10 but very big domains with millions of pages on
> each site. I now wonder way I got after 2 weeks and 17 crawl-fetch cycles
> only a handful of about 30,000 pages and it seems stagnating.
>
> How would you accelerate fetching?
>
> My current parameters (using Nutch-1.2):
> topN: 40,000
> depth: 8
> adddays: 30
> fetcher.server.delay: 1
> db.max.outlinks.per.page: 500
>
> All parameters not mentioned have standard values as well as
> regex-urlfilter.txt.
>
> Best Regards
> Thomas
>
>
> ________________________________
>
> GfK SE, Nuremberg, Germany, commercial register Nuremberg HRB 25014;
> Management Board: Professor Dr. Klaus L. W?bbenhorst (CEO), Pamela Knapp
> (CFO), Dr. Gerhard Hausruckinger, Petra Heinlein, Debra A. Pruent, Wilhelm
> R. Wessels; Chairman of the Supervisory Board: Dr. Arno Mahlert
> This email and any attachments may contain confidential or privileged
> information. Please note that unauthorized copying, disclosure or
> distribution of the material in this email is not permitted.
>



-- 
*Lewis*

Reply via email to