Re: Large website not fully crawled

Tolga Thu, 24 May 2012 01:20:12 -0700


On 5/24/12 11:00 AM, Piet van Remortel wrote:

On Thu, May 24, 2012 at 9:35 AM, Tolga<[email protected]>  wrote:

- I don't fully understand the use of topN parameter. Should I increase it?

yes

What would a sensible topN value be a for a large university website?

- You mean parse-pdf thing? I've got that in my nutch-default.xml.

good, should work then

- I looked for the link, it was there. Besides, that was for another
website I was experimenting on.
- How do I check segments?

e.g. with segmentreader, a hadoop access command built in nutch

- I didn't check filenames, but I've tried searching for a word in that
PDF file.

then the reason could also be indexing

- I've got more than 50gb free.
- I'm not sure about webserver kicking me off, I'll have the check that
with the sysadmin.

should be visible as something like timeouts or a similar message in the
hadoop logs

Regards,


On 5/24/12 10:25 AM, Piet van Remortel wrote:

- your topN parameter limited the crawl : see the info at
http://wiki.apache.org/nutch/**NutchTutorial<http://wiki.apache.org/nutch/NutchTutorial>

or :

- file filters
- there is no link to the files (as you suggested yourself already)
- did you check the correct/all segments ?
- did you check the fully correct filenames ? wildcards don't work on all
segmentreader approaches
- size limits of the crawler (see previous discussion)
- did you check file presence in the segment, or parse result ?  i.e.
parsing could have failed (cfr the previous discussion of the last few
days)
- your disk got full and crawling stopped
- the webserver(s) kicked you off
- your hadoop logs have overrun the local disk on which the crawler was
running (i.e. disk full)

Piet


On Thu, May 24, 2012 at 9:17 AM, Tolga<[email protected]>   wrote:

  Hi,

I am crawling a large website, which is our university's. From the logs
and some grep'ing, I see that some pdf files were not crawled. Why could
this happen? I'm crawling with -depth 100 -topN 5.

Regards,

Re: Large website not fully crawled

Reply via email to