Re: Large website not fully crawled

Tolga Thu, 24 May 2012 04:53:19 -0700

I might have figured out why. Our website has a lot of query strings inaddresses. One example ishttp://www.sabanciuniv.edu/eng/?genel_bilgi/yonetim/yonetim_kapak/yonetim_kapak.html.Could this be why? If that's the case, how do I crawl it?


Regards,


On 5/24/12 11:28 AM, Piet van Remortel wrote:

I googled for you:

"Typically one starts testing one’s configuration by crawling at shallow
depths, sharply limiting the number of pages fetched at each level (-topN),
and watching the output to check that desired pages are fetched and
undesirable pages are not. Once one is confident of the configuration, then
an appropriate depth for a full crawl is around 10. The number of pages per
level (-topN) for a full crawl can be from tens of thousands to millions,
depending on your resources."

Also, as the nutch documentation shows, the topN parameter is optional.

Can I respectfully suggest that you go through the basic information that
is available online to get familiar with Nutch.  Copying the online
information into this mailing list is not helping anybody.


On Thu, May 24, 2012 at 10:19 AM, Tolga<[email protected]>  wrote:


On 5/24/12 11:00 AM, Piet van Remortel wrote:

On Thu, May 24, 2012 at 9:35 AM, Tolga<[email protected]>   wrote:

  - I don't fully understand the use of topN parameter. Should I increase

it?

  yes

What would a sensible topN value be a for a large university website?


  - You mean parse-pdf thing? I've got that in my nutch-default.xml.

  good, should work then


  - I looked for the link, it was there. Besides, that was for another

website I was experimenting on.
- How do I check segments?

  e.g. with segmentreader, a hadoop access command built in nutch


  - I didn't check filenames, but I've tried searching for a word in that

PDF file.

  then the reason could also be indexing


  - I've got more than 50gb free.

- I'm not sure about webserver kicking me off, I'll have the check that
with the sysadmin.

  should be visible as something like timeouts or a similar message in the

hadoop logs


  Regards,


On 5/24/12 10:25 AM, Piet van Remortel wrote:

  - your topN parameter limited the crawl : see the info at

http://wiki.apache.org/nutch/****NutchTutorial<http://wiki.apache.org/nutch/**NutchTutorial>
<http://wiki.**apache.org/nutch/NutchTutorial<http://wiki.apache.org/nutch/NutchTutorial>
**>


or :

- file filters
- there is no link to the files (as you suggested yourself already)
- did you check the correct/all segments ?
- did you check the fully correct filenames ? wildcards don't work on
all
segmentreader approaches
- size limits of the crawler (see previous discussion)
- did you check file presence in the segment, or parse result ?  i.e.
parsing could have failed (cfr the previous discussion of the last few
days)
- your disk got full and crawling stopped
- the webserver(s) kicked you off
- your hadoop logs have overrun the local disk on which the crawler was
running (i.e. disk full)

Piet


On Thu, May 24, 2012 at 9:17 AM, Tolga<[email protected]>    wrote:

  Hi,

I am crawling a large website, which is our university's. From the logs
and some grep'ing, I see that some pdf files were not crawled. Why
could
this happen? I'm crawling with -depth 100 -topN 5.

Regards,

Re: Large website not fully crawled

Reply via email to