Re: Large website not fully crawled

Piet van Remortel Thu, 24 May 2012 00:25:46 -0700

- your topN parameter limited the crawl : see the info at
http://wiki.apache.org/nutch/NutchTutorial


or :

- file filters
- there is no link to the files (as you suggested yourself already)
- did you check the correct/all segments ?
- did you check the fully correct filenames ? wildcards don't work on all
segmentreader approaches
- size limits of the crawler (see previous discussion)
- did you check file presence in the segment, or parse result ?  i.e.
parsing could have failed (cfr the previous discussion of the last few days)
- your disk got full and crawling stopped
- the webserver(s) kicked you off
- your hadoop logs have overrun the local disk on which the crawler was
running (i.e. disk full)

Piet


On Thu, May 24, 2012 at 9:17 AM, Tolga <[email protected]> wrote:

> Hi,
>
> I am crawling a large website, which is our university's. From the logs
> and some grep'ing, I see that some pdf files were not crawled. Why could
> this happen? I'm crawling with -depth 100 -topN 5.
>
> Regards,
>

Re: Large website not fully crawled

Reply via email to