Hi Alexei, Because users are lazy some browser automatically try to add the www (and other stuff) to escape from a "server not found" error, see http://www-archive.mozilla.org/docs/end-user/domain-guessing.html
Nutch does no domain guessing. The urls have to be correct and the host name must be complete. Finally, even if test.com sends a HTTP redirect pointing to www.test.com : check your URL filters whether both hosts are accepted. Sebastian On 08/04/2012 05:33 PM, Mathijs Homminga wrote:> What do you mean exactly with "it falls on fetch phase"? > Do you get an error? > Does "test.com" exist? > Does it perhaps redirect to "www.test.com"? > ... > > Mathijs > > On Aug 4, 2012, at 17:11 , Alexei Korolev <[email protected]> wrote: > >> yes >> >> On Sat, Aug 4, 2012 at 6:11 PM, Lewis John Mcgibbney < >> [email protected]> wrote: >> >>> http:// ? >>> >>> hth >>> >>> On Fri, Aug 3, 2012 at 9:53 AM, Alexei Korolev <[email protected]> >>> wrote: >>>> Hello, >>>> >>>> I have small script >>>> >>>> $NUTCH_PATH inject crawl/crawldb seed.txt >>>> $NUTCH_PATH generate crawl/crawldb crawl/crawldb/segments -adddays 0 >>>> >>>> s1=`ls -d crawl/crawldb/segments/* | tail -1` >>>> $NUTCH_PATH fetch $s1 >>>> $NUTCH_PATH parse $s1 >>>> $NUTCH_PATH updatedb crawl/crawldb $s1 >>>> >>>> In seed.txt I have just one site, for example "test.com". When I start >>>> script it falls on fetch phase. >>>> If I change test.com on www.test.com it works fine. Seems the reason, >>> that >>>> outgoing link on test.com all have www. prefix. >>>> What I need to change in nutch config for work with test.com? >>>> >>>> Thank you in advance. I hope my explanation is clear :) >>>> >>>> -- >>>> Alexei A. Korolev >>> >>> >>> >>> -- >>> Lewis >>> >> >> >> >> -- >> Alexei A. Korolev >

