Re: crawling site without www

Mathijs Homminga Sat, 04 Aug 2012 08:33:57 -0700

What do you mean exactly with "it falls on fetch phase"?
Do  you get an error? 
Does "test.com" exist? 
Does it perhaps redirect to "www.test.com"?
...


Mathijs


On Aug 4, 2012, at 17:11 , Alexei Korolev <alexei.koro...@gmail.com> wrote:

> yes
> 
> On Sat, Aug 4, 2012 at 6:11 PM, Lewis John Mcgibbney <
> lewis.mcgibb...@gmail.com> wrote:
> 
>> http://   ?
>> 
>> hth
>> 
>> On Fri, Aug 3, 2012 at 9:53 AM, Alexei Korolev <alexei.koro...@gmail.com>
>> wrote:
>>> Hello,
>>> 
>>> I have small script
>>> 
>>> $NUTCH_PATH inject crawl/crawldb seed.txt
>>> $NUTCH_PATH generate crawl/crawldb crawl/crawldb/segments -adddays 0
>>> 
>>> s1=`ls -d crawl/crawldb/segments/* | tail -1`
>>> $NUTCH_PATH fetch $s1
>>> $NUTCH_PATH parse $s1
>>> $NUTCH_PATH updatedb crawl/crawldb $s1
>>> 
>>> In seed.txt I have just one site, for example "test.com". When I start
>>> script it falls on fetch phase.
>>> If I change test.com on www.test.com it works fine. Seems the reason,
>> that
>>> outgoing link on test.com all have www. prefix.
>>> What I need to change in nutch config for work with test.com?
>>> 
>>> Thank you in advance. I hope my explanation is clear :)
>>> 
>>> --
>>> Alexei A. Korolev
>> 
>> 
>> 
>> --
>> Lewis
>> 
> 
> 
> 
> -- 
> Alexei A. Korolev

Re: crawling site without www

Reply via email to