Re: Large website not fully crawled

Piet van Remortel Thu, 24 May 2012 01:29:27 -0700

I googled for you:

"Typically one starts testing one’s configuration by crawling at shallow
depths, sharply limiting the number of pages fetched at each level (-topN),
and watching the output to check that desired pages are fetched and
undesirable pages are not. Once one is confident of the configuration, then
an appropriate depth for a full crawl is around 10. The number of pages per
level (-topN) for a full crawl can be from tens of thousands to millions,
depending on your resources."


Also, as the nutch documentation shows, the topN parameter is optional.

Can I respectfully suggest that you go through the basic information that
is available online to get familiar with Nutch.  Copying the online
information into this mailing list is not helping anybody.


On Thu, May 24, 2012 at 10:19 AM, Tolga <[email protected]> wrote:

>
>
> On 5/24/12 11:00 AM, Piet van Remortel wrote:
>
>> On Thu, May 24, 2012 at 9:35 AM, Tolga<[email protected]>  wrote:
>>
>>  - I don't fully understand the use of topN parameter. Should I increase
>>> it?
>>>
>>>  yes
>>
> What would a sensible topN value be a for a large university website?
>
>>
>>
>>  - You mean parse-pdf thing? I've got that in my nutch-default.xml.
>>>
>>>  good, should work then
>>
>>
>>  - I looked for the link, it was there. Besides, that was for another
>>> website I was experimenting on.
>>> - How do I check segments?
>>>
>>>  e.g. with segmentreader, a hadoop access command built in nutch
>>
>>
>>  - I didn't check filenames, but I've tried searching for a word in that
>>> PDF file.
>>>
>>>  then the reason could also be indexing
>>
>>
>>  - I've got more than 50gb free.
>>> - I'm not sure about webserver kicking me off, I'll have the check that
>>> with the sysadmin.
>>>
>>>  should be visible as something like timeouts or a similar message in the
>> hadoop logs
>>
>>
>>  Regards,
>>>
>>>
>>> On 5/24/12 10:25 AM, Piet van Remortel wrote:
>>>
>>>  - your topN parameter limited the crawl : see the info at
>>>> http://wiki.apache.org/nutch/****NutchTutorial<http://wiki.apache.org/nutch/**NutchTutorial>
>>>> <http://wiki.**apache.org/nutch/NutchTutorial<http://wiki.apache.org/nutch/NutchTutorial>
>>>> **>
>>>>
>>>>
>>>> or :
>>>>
>>>> - file filters
>>>> - there is no link to the files (as you suggested yourself already)
>>>> - did you check the correct/all segments ?
>>>> - did you check the fully correct filenames ? wildcards don't work on
>>>> all
>>>> segmentreader approaches
>>>> - size limits of the crawler (see previous discussion)
>>>> - did you check file presence in the segment, or parse result ?  i.e.
>>>> parsing could have failed (cfr the previous discussion of the last few
>>>> days)
>>>> - your disk got full and crawling stopped
>>>> - the webserver(s) kicked you off
>>>> - your hadoop logs have overrun the local disk on which the crawler was
>>>> running (i.e. disk full)
>>>>
>>>> Piet
>>>>
>>>>
>>>> On Thu, May 24, 2012 at 9:17 AM, Tolga<[email protected]>   wrote:
>>>>
>>>>  Hi,
>>>>
>>>>> I am crawling a large website, which is our university's. From the logs
>>>>> and some grep'ing, I see that some pdf files were not crawled. Why
>>>>> could
>>>>> this happen? I'm crawling with -depth 100 -topN 5.
>>>>>
>>>>> Regards,
>>>>>
>>>>>
>>>>>

Re: Large website not fully crawled

Reply via email to