Hi List,
Hi Hannes,

All logs are without Errors and Warnings. Injecting, Updating, merging and 
indexing is not a problem and takes minutes only. One cycle takes 2 days with 
my parameters. Regex-urlfilter.txt is checked against the URL format from all 
sites.

But I'm sorry to the list, I may have not clear asked. I'm interested mainly 
why there is such big difference between fetched and unfetched URLs and what 
can I do to force fetching?

Please see my current readdb -stats output:
TOTAL urls: 1698520
[...]
status 1 (db_unfetched): 1567047
status 2 (db_fetched): 90399
status 3 (db_gone): 11696
status 4 (db_redir_temp): 4065
status 5 (db_redir_perm): 10137
status 6 (db_notmodified): 15176

The process runs now exactly 30 days. In the meantime I have now 90,399 fetched 
instead of 30,000 after 15 days. Is this normal?

Regards
Thomas

Von: Hannes Carl Meyer [mailto:[email protected]]
Gesendet: Dienstag, 30. August 2011 09:25
An: [email protected]
Cc: Eggebrecht, Thomas (GfK Marktforschung)
Betreff: Re: Parameter tuning or how to accelerate fetching

Hi Thomas,

first, 30,000 pages in two weeks is somewhat of few...

where did you get the total number of pages from? By Crawl-DB?
Please post a bin/nutch readdb crawldb/ -stats output here.

How long does one cycle takes?

If your regex-urlfilter.txt is still the standard setting, check your websites 
for common query URLs containing like "index.php?param=value&param1..". The 
standard regex-urlfilter is sometimes very strict in this case.

BR

Hannes

--

https://www.xing.com/profile/HannesCarl_Meyer
http://de.linkedin.com/in/hannescarlmeyer
On Mon, Aug 29, 2011 at 5:33 PM, Eggebrecht, Thomas (GfK Marktforschung) 
<[email protected]<mailto:[email protected]>> wrote:
Dear List,

My process fetches only 10 but very big domains with millions of pages on each 
site. I now wonder way I got after 2 weeks and 17 crawl-fetch cycles only a 
handful of about 30,000 pages and it seems stagnating.

How would you accelerate fetching?

My current parameters (using Nutch-1.2):
topN: 40,000
depth: 8
adddays: 30
fetcher.server.delay: 1
db.max.outlinks.per.page: 500

All parameters not mentioned have standard values as well as 
regex-urlfilter.txt.

Best Regards
Thomas


________________________________

GfK SE, Nuremberg, Germany, commercial register Nuremberg HRB 25014; Management 
Board: Professor Dr. Klaus L. W?bbenhorst (CEO), Pamela Knapp (CFO), Dr. 
Gerhard Hausruckinger, Petra Heinlein, Debra A. Pruent, Wilhelm R. Wessels; 
Chairman of the Supervisory Board: Dr. Arno Mahlert
This email and any attachments may contain confidential or privileged 
information. Please note that unauthorized copying, disclosure or distribution 
of the material in this email is not permitted.



________________________________

GfK SE, Nuremberg, Germany, commercial register Nuremberg HRB 25014; Management 
Board: Professor Dr. Klaus L. W?bbenhorst (CEO), Pamela Knapp (CFO), Dr. 
Gerhard Hausruckinger, Petra Heinlein, Debra A. Pruent, Wilhelm R. Wessels; 
Chairman of the Supervisory Board: Dr. Arno Mahlert
This email and any attachments may contain confidential or privileged 
information. Please note that unauthorized copying, disclosure or distribution 
of the material in this email is not permitted.

Reply via email to