Hi Thomas,

first, 30,000 pages in two weeks is somewhat of few...

where did you get the total number of pages from? By Crawl-DB?
Please post a bin/nutch readdb crawldb/ -stats output here.

How long does one cycle takes?

If your regex-urlfilter.txt is still the standard setting, check your
websites for common query URLs containing like
"index.php?param=value&param1..". The standard regex-urlfilter is sometimes
very strict in this case.

BR

Hannes

-- 

https://www.xing.com/profile/HannesCarl_Meyer
http://de.linkedin.com/in/hannescarlmeyer

On Mon, Aug 29, 2011 at 5:33 PM, Eggebrecht, Thomas (GfK Marktforschung) <
[email protected]> wrote:

> Dear List,
>
> My process fetches only 10 but very big domains with millions of pages on
> each site. I now wonder way I got after 2 weeks and 17 crawl-fetch cycles
> only a handful of about 30,000 pages and it seems stagnating.
>
> How would you accelerate fetching?
>
> My current parameters (using Nutch-1.2):
> topN: 40,000
> depth: 8
> adddays: 30
> fetcher.server.delay: 1
> db.max.outlinks.per.page: 500
>
> All parameters not mentioned have standard values as well as
> regex-urlfilter.txt.
>
> Best Regards
> Thomas
>
>
> ________________________________
>
> GfK SE, Nuremberg, Germany, commercial register Nuremberg HRB 25014;
> Management Board: Professor Dr. Klaus L. W?bbenhorst (CEO), Pamela Knapp
> (CFO), Dr. Gerhard Hausruckinger, Petra Heinlein, Debra A. Pruent, Wilhelm
> R. Wessels; Chairman of the Supervisory Board: Dr. Arno Mahlert
> This email and any attachments may contain confidential or privileged
> information. Please note that unauthorized copying, disclosure or
> distribution of the material in this email is not permitted.
>

Reply via email to