Re: Large website not fully crawled

Piet van Remortel Thu, 24 May 2012 05:11:04 -0700

that could be it indeed

I googled it for you, first hit searching for "nutch crawl query pages"


http://stackoverflow.com/questions/7045716/nutch-1-2-why-wont-nutch-crawl-url-with-query-strings


On Thu, May 24, 2012 at 1:52 PM, Tolga <[email protected]> wrote:

> I might have figured out why. Our website has a lot of query strings in
> addresses. One example is http://www.sabanciuniv.edu/**
> eng/?genel_bilgi/yonetim/**yonetim_kapak/yonetim_kapak.**html<http://www.sabanciuniv.edu/eng/?genel_bilgi/yonetim/yonetim_kapak/yonetim_kapak.html>.
> Could this be why? If that's the case, how do I crawl it?
>
> Regards,
>
>
> On 5/24/12 11:28 AM, Piet van Remortel wrote:
>
>> I googled for you:
>>
>> "Typically one starts testing one’s configuration by crawling at shallow
>> depths, sharply limiting the number of pages fetched at each level
>> (-topN),
>> and watching the output to check that desired pages are fetched and
>> undesirable pages are not. Once one is confident of the configuration,
>> then
>> an appropriate depth for a full crawl is around 10. The number of pages
>> per
>> level (-topN) for a full crawl can be from tens of thousands to millions,
>> depending on your resources."
>>
>> Also, as the nutch documentation shows, the topN parameter is optional.
>>
>> Can I respectfully suggest that you go through the basic information that
>> is available online to get familiar with Nutch.  Copying the online
>> information into this mailing list is not helping anybody.
>>
>>
>> On Thu, May 24, 2012 at 10:19 AM, Tolga<[email protected]>  wrote:
>>
>>
>>> On 5/24/12 11:00 AM, Piet van Remortel wrote:
>>>
>>>  On Thu, May 24, 2012 at 9:35 AM, Tolga<[email protected]>   wrote:
>>>>
>>>>  - I don't fully understand the use of topN parameter. Should I increase
>>>>
>>>>> it?
>>>>>
>>>>>  yes
>>>>>
>>>> What would a sensible topN value be a for a large university website?
>>>
>>>
>>>>  - You mean parse-pdf thing? I've got that in my nutch-default.xml.
>>>>
>>>>>  good, should work then
>>>>>
>>>>
>>>>  - I looked for the link, it was there. Besides, that was for another
>>>>
>>>>> website I was experimenting on.
>>>>> - How do I check segments?
>>>>>
>>>>>  e.g. with segmentreader, a hadoop access command built in nutch
>>>>>
>>>>
>>>>  - I didn't check filenames, but I've tried searching for a word in that
>>>>
>>>>> PDF file.
>>>>>
>>>>>  then the reason could also be indexing
>>>>>
>>>>
>>>>  - I've got more than 50gb free.
>>>>
>>>>> - I'm not sure about webserver kicking me off, I'll have the check that
>>>>> with the sysadmin.
>>>>>
>>>>>  should be visible as something like timeouts or a similar message in
>>>>> the
>>>>>
>>>> hadoop logs
>>>>
>>>>
>>>>  Regards,
>>>>
>>>>>
>>>>> On 5/24/12 10:25 AM, Piet van Remortel wrote:
>>>>>
>>>>>  - your topN parameter limited the crawl : see the info at
>>>>>
>>>>>> http://wiki.apache.org/nutch/******NutchTutorial<http://wiki.apache.org/nutch/****NutchTutorial>
>>>>>> <http://wiki.**apache.org/nutch/****NutchTutorial<http://wiki.apache.org/nutch/**NutchTutorial>
>>>>>> >
>>>>>> <http://wiki.**apache.org/**nutch/NutchTutorial<http://apache.org/nutch/NutchTutorial>
>>>>>> <http://**wiki.apache.org/nutch/**NutchTutorial<http://wiki.apache.org/nutch/NutchTutorial>
>>>>>> >
>>>>>> **>
>>>>>>
>>>>>>
>>>>>>
>>>>>> or :
>>>>>>
>>>>>> - file filters
>>>>>> - there is no link to the files (as you suggested yourself already)
>>>>>> - did you check the correct/all segments ?
>>>>>> - did you check the fully correct filenames ? wildcards don't work on
>>>>>> all
>>>>>> segmentreader approaches
>>>>>> - size limits of the crawler (see previous discussion)
>>>>>> - did you check file presence in the segment, or parse result ?  i.e.
>>>>>> parsing could have failed (cfr the previous discussion of the last few
>>>>>> days)
>>>>>> - your disk got full and crawling stopped
>>>>>> - the webserver(s) kicked you off
>>>>>> - your hadoop logs have overrun the local disk on which the crawler
>>>>>> was
>>>>>> running (i.e. disk full)
>>>>>>
>>>>>> Piet
>>>>>>
>>>>>>
>>>>>> On Thu, May 24, 2012 at 9:17 AM, Tolga<[email protected]>    wrote:
>>>>>>
>>>>>>  Hi,
>>>>>>
>>>>>>  I am crawling a large website, which is our university's. From the
>>>>>>> logs
>>>>>>> and some grep'ing, I see that some pdf files were not crawled. Why
>>>>>>> could
>>>>>>> this happen? I'm crawling with -depth 100 -topN 5.
>>>>>>>
>>>>>>> Regards,
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>

Re: Large website not fully crawled

Reply via email to