On Thu, May 24, 2012 at 9:35 AM, Tolga <[email protected]> wrote: > - I don't fully understand the use of topN parameter. Should I increase it? >
yes > - You mean parse-pdf thing? I've got that in my nutch-default.xml. > good, should work then > - I looked for the link, it was there. Besides, that was for another > website I was experimenting on. > - How do I check segments? > e.g. with segmentreader, a hadoop access command built in nutch > - I didn't check filenames, but I've tried searching for a word in that > PDF file. > then the reason could also be indexing > - I've got more than 50gb free. > - I'm not sure about webserver kicking me off, I'll have the check that > with the sysadmin. > should be visible as something like timeouts or a similar message in the hadoop logs > > Regards, > > > On 5/24/12 10:25 AM, Piet van Remortel wrote: > >> - your topN parameter limited the crawl : see the info at >> http://wiki.apache.org/nutch/**NutchTutorial<http://wiki.apache.org/nutch/NutchTutorial> >> >> or : >> >> - file filters >> - there is no link to the files (as you suggested yourself already) >> - did you check the correct/all segments ? >> - did you check the fully correct filenames ? wildcards don't work on all >> segmentreader approaches >> - size limits of the crawler (see previous discussion) >> - did you check file presence in the segment, or parse result ? i.e. >> parsing could have failed (cfr the previous discussion of the last few >> days) >> - your disk got full and crawling stopped >> - the webserver(s) kicked you off >> - your hadoop logs have overrun the local disk on which the crawler was >> running (i.e. disk full) >> >> Piet >> >> >> On Thu, May 24, 2012 at 9:17 AM, Tolga<[email protected]> wrote: >> >> Hi, >>> >>> I am crawling a large website, which is our university's. From the logs >>> and some grep'ing, I see that some pdf files were not crawled. Why could >>> this happen? I'm crawling with -depth 100 -topN 5. >>> >>> Regards, >>> >>>

