Hi,

I was running crawl from the last 5 days. Till yesterday there were no
problem during the process. The fetching and updating was happening fine.
when i started the crawl yesterday with conf of TopN:80000
the process is stucked up in fetching , Its almost 16hr but till now it has
not compleated the depth 1.
I am using a cluster of 4 systems with memory of 4GB and Harddisk space of
2TB in three of the machines and 500GB in the other machine. I checked the
space in all the clusters. All the systems are having a minimum of 50GB
space free. I went through the mailing list and found few suggestion like to
reduce maximum pages per url. but before doing that i need few more
suggestions regarding the same. can any one pls help me with there
suggestions.

Hadoop Log is showing the message like.
2010-05-24 19:33:14,755 INFO  crawl.Generator - Generator: Selecting
best-scoring urls due for fetch.
2010-05-24 19:33:14,756 INFO  crawl.Generator - Generator: starting
2010-05-24 19:33:14,756 INFO  crawl.Generator - Generator: segment:
crawled/segments/20100524193314
2010-05-24 19:33:14,756 INFO  crawl.Generator - Generator: filtering: true
2010-05-24 19:33:14,756 INFO  crawl.Generator - Generator: topN: 80000
2010-05-24 20:18:05,558 INFO  crawl.Generator - Generator: Partitioning
selected urls by host, for politeness.
2010-05-24 20:19:36,187 INFO  crawl.Generator - Generator: done.
2010-05-24 20:19:37,672 INFO  fetcher.Fetcher - Fetcher: starting
2010-05-24 20:19:37,672 INFO  fetcher.Fetcher - Fetcher: segment:
crawled/segments/20100524193314
2010-05-24 20:19:37,842 WARN  mapred.JobClient - Use GenericOptionsParser
for parsing the arguments. Applications should implement Tool for the same.

Its stuck here.

-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Fetching-is-slow-after-few-crawls-tp841366p841366.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Reply via email to