Hey

I have a batch of 5000 seed URLs. I am trying to crawl these URLs by utilizing the apache job created after the command "ant clean runtime" is executed. In the first 2 cycles of nutch workflow i.e. inject->generate->fetch->parse->updatedb, it is working fine. Also, it is able to fetch around 20,000 URLs. But, after the 2nd cycle, when the workflow is executed, the no. of documents present with status 2 present in the database start to decrease.

For example: the no. of documents with status 2 after the 2nd cycle were 22220 and the total number of links after updatedb present were 75882.

And after the 3rd cycle, documents with status 2 decreased to 22209 the total no of links have increased to 78443. As checked in the logs, the job is not resulting in any error. Unable to debug this. Are there some changes that need to be made in the nutch configurations.

Please reply if any more details that need to be mentioned for a better understanding of the problem. This is like a black box testing where I am unable to come to a conclusion.

Please reply soon. Thanks in advance

Shubham Gupta

Reply via email to