Re: nutch redirection behaviour issue

Sebastian Nagel Mon, 15 Jul 2013 13:29:23 -0700

Hi,

if you are able to extract content via parsechecker
you should be able to crawl the content.

For all _3_ URLs in the redirect chain

1. check whether they pass URL filters and normalizers

2. check whether "http.redirect.max" is set appropriately

3. run crawl. Ideally, set the URL to be checked as seed URL
   and choose small values for depth and topN. That makes
   analysis simpler. If "http.redirect.max" >= 3 you can even
   set depth and topN to 1.

4. check you logs for all _3_ URLs. You should see "fetching ..."
   3 times (3 URLs)

5. then check crawl Db for all URLs
   % bin/nutch readdb .../crawldb -url URL

6. check content of segment(s) for all URLs

Sorry, there is no tool which does all the steps automatically.
You have to do it by hand.

Good luck,
Sebastian

On 07/15/2013 06:39 AM, devang pandey wrote:
> Hello Sebastian, Thankyou for your response . But thing is that my task is
> to crawl this url and using parsechecker command I am able to see content
> of page but not able to crawl it .Please help me with crawling aspect also.
> 
>

Re: nutch redirection behaviour issue

Reply via email to