- your topN parameter limited the crawl : see the info at http://wiki.apache.org/nutch/NutchTutorial
or : - file filters - there is no link to the files (as you suggested yourself already) - did you check the correct/all segments ? - did you check the fully correct filenames ? wildcards don't work on all segmentreader approaches - size limits of the crawler (see previous discussion) - did you check file presence in the segment, or parse result ? i.e. parsing could have failed (cfr the previous discussion of the last few days) - your disk got full and crawling stopped - the webserver(s) kicked you off - your hadoop logs have overrun the local disk on which the crawler was running (i.e. disk full) Piet On Thu, May 24, 2012 at 9:17 AM, Tolga <[email protected]> wrote: > Hi, > > I am crawling a large website, which is our university's. From the logs > and some grep'ing, I see that some pdf files were not crawled. Why could > this happen? I'm crawling with -depth 100 -topN 5. > > Regards, >

