RE: How To Validate Nutch Crawl

Markus Jelsma Tue, 15 Dec 2015 11:31:51 -0800

Hello - this is not the easiest thing to do because of transient errors (which 
do happen frequently enough). Best thing to do is to start the crawl, and 
recrawl until the generate job returns something like "No records selected", 
which means the generator thinks there is nothing to crawl anymore.


Then you run a readdb -stats to check the status counts. If nothing is 
db_unfetched, everything is crawled. If there are some still unfetched, there 
are most likely also stats for retry, meaning they failed and are eligible for 
refetch the next day (configurable).

M.

 
-----Original message-----
> From:Manish Verma <[email protected]>
> Sent: Tuesday 15th December 2015 20:04
> To: [email protected]
> Subject: How To Validate Nutch Crawl
> 
> Hi,
> 
> I want to validate nutch crawl just to make sure all links (URL) has been 
> crawled. For e.g if one page has 500 URL then want to make sure it crawled 
> all 500.
> One Way is to manually identify all links on page and then check that url is 
> present in crawled URLS.
> 
> Another thing is there anyway to check which URL’s could not be crawled, like 
> due to some filter or website did not allow to crawl some page or some other 
> reason.
> 
> Thanks
> 
> 
>

RE: How To Validate Nutch Crawl

Reply via email to