Re: crawling site without www

Sebastian Nagel Sat, 04 Aug 2012 12:24:08 -0700

Hi Alexei,

Because users are lazy some browser automatically
try to add the www (and other stuff) to escape from
a "server not found" error, see
http://www-archive.mozilla.org/docs/end-user/domain-guessing.html


Nutch does no domain guessing. The urls have to be correct
and the host name must be complete.

Finally, even if test.com sends a HTTP redirect pointing
to www.test.com : check your URL filters whether both
hosts are accepted.

Sebastian

On 08/04/2012 05:33 PM, Mathijs Homminga wrote:> What do you mean exactly with 
"it falls on fetch
phase"?
> Do  you get an error? 
> Does "test.com" exist? 
> Does it perhaps redirect to "www.test.com"?
> ...
> 
> Mathijs
> 
> On Aug 4, 2012, at 17:11 , Alexei Korolev <[email protected]> wrote:
> 
>> yes
>>
>> On Sat, Aug 4, 2012 at 6:11 PM, Lewis John Mcgibbney <
>> [email protected]> wrote:
>>
>>> http://   ?
>>>
>>> hth
>>>
>>> On Fri, Aug 3, 2012 at 9:53 AM, Alexei Korolev <[email protected]>
>>> wrote:
>>>> Hello,
>>>>
>>>> I have small script
>>>>
>>>> $NUTCH_PATH inject crawl/crawldb seed.txt
>>>> $NUTCH_PATH generate crawl/crawldb crawl/crawldb/segments -adddays 0
>>>>
>>>> s1=`ls -d crawl/crawldb/segments/* | tail -1`
>>>> $NUTCH_PATH fetch $s1
>>>> $NUTCH_PATH parse $s1
>>>> $NUTCH_PATH updatedb crawl/crawldb $s1
>>>>
>>>> In seed.txt I have just one site, for example "test.com". When I start
>>>> script it falls on fetch phase.
>>>> If I change test.com on www.test.com it works fine. Seems the reason,
>>> that
>>>> outgoing link on test.com all have www. prefix.
>>>> What I need to change in nutch config for work with test.com?
>>>>
>>>> Thank you in advance. I hope my explanation is clear :)
>>>>
>>>> --
>>>> Alexei A. Korolev
>>>
>>>
>>>
>>> --
>>> Lewis
>>>
>>
>>
>>
>> -- 
>> Alexei A. Korolev
>

Re: crawling site without www

Reply via email to