Re: Large website not fully crawled

Piet van Remortel Thu, 24 May 2012 01:01:09 -0700

On Thu, May 24, 2012 at 9:35 AM, Tolga <[email protected]> wrote:

> - I don't fully understand the use of topN parameter. Should I increase it?
>


yes



> - You mean parse-pdf thing? I've got that in my nutch-default.xml.
>

good, should work then


> - I looked for the link, it was there. Besides, that was for another
> website I was experimenting on.
> - How do I check segments?
>

e.g. with segmentreader, a hadoop access command built in nutch


> - I didn't check filenames, but I've tried searching for a word in that
> PDF file.
>

then the reason could also be indexing


> - I've got more than 50gb free.
> - I'm not sure about webserver kicking me off, I'll have the check that
> with the sysadmin.
>

should be visible as something like timeouts or a similar message in the
hadoop logs


>
> Regards,
>
>
> On 5/24/12 10:25 AM, Piet van Remortel wrote:
>
>> - your topN parameter limited the crawl : see the info at
>> http://wiki.apache.org/nutch/**NutchTutorial<http://wiki.apache.org/nutch/NutchTutorial>
>>
>> or :
>>
>> - file filters
>> - there is no link to the files (as you suggested yourself already)
>> - did you check the correct/all segments ?
>> - did you check the fully correct filenames ? wildcards don't work on all
>> segmentreader approaches
>> - size limits of the crawler (see previous discussion)
>> - did you check file presence in the segment, or parse result ?  i.e.
>> parsing could have failed (cfr the previous discussion of the last few
>> days)
>> - your disk got full and crawling stopped
>> - the webserver(s) kicked you off
>> - your hadoop logs have overrun the local disk on which the crawler was
>> running (i.e. disk full)
>>
>> Piet
>>
>>
>> On Thu, May 24, 2012 at 9:17 AM, Tolga<[email protected]>  wrote:
>>
>>  Hi,
>>>
>>> I am crawling a large website, which is our university's. From the logs
>>> and some grep'ing, I see that some pdf files were not crawled. Why could
>>> this happen? I'm crawling with -depth 100 -topN 5.
>>>
>>> Regards,
>>>
>>>

Re: Large website not fully crawled

Reply via email to