Hello, Thank you for reply.
Here is my regex-urlfilter.txt # The default url filter. # Better for whole-internet crawling. # Each non-comment, non-blank line contains a regular expression # prefixed by '+' or '-'. The first matching pattern in the file # determines whether a URL is included or ignored. If no pattern # matches, the URL is ignored. # skip file: ftp: and mailto: urls -^(file|ftp|mailto): # skip image and other suffixes we can't yet parse # for a more extensive coverage use the urlfilter-suffix plugin -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$ # skip URLs containing certain characters as probable queries, etc. -[?*!@=] # skip URLs with slash-delimited segment that repeats 3+ times, to break loops -.*(/[^/]+)/[^/]+\1/[^/]+\1/ # accept anything else +. and prefix-urlfilter.txt # config file for urlfilter-prefix plugin http:// https:// ftp:// file:// Looks all fine for me. Right? On Sat, Aug 4, 2012 at 11:16 PM, Sebastian Nagel <[email protected] > wrote: > Hi Alexei, > > Because users are lazy some browser automatically > try to add the www (and other stuff) to escape from > a "server not found" error, see > http://www-archive.mozilla.org/docs/end-user/domain-guessing.html > > Nutch does no domain guessing. The urls have to be correct > and the host name must be complete. > > Finally, even if test.com sends a HTTP redirect pointing > to www.test.com : check your URL filters whether both > hosts are accepted. > > Sebastian > > On 08/04/2012 05:33 PM, Mathijs Homminga wrote:> What do you mean exactly > with "it falls on fetch > phase"? > > Do you get an error? > > Does "test.com" exist? > > Does it perhaps redirect to "www.test.com"? > > ... > > > > Mathijs > > > > On Aug 4, 2012, at 17:11 , Alexei Korolev <[email protected]> > wrote: > > > >> yes > >> > >> On Sat, Aug 4, 2012 at 6:11 PM, Lewis John Mcgibbney < > >> [email protected]> wrote: > >> > >>> http:// ? > >>> > >>> hth > >>> > >>> On Fri, Aug 3, 2012 at 9:53 AM, Alexei Korolev < > [email protected]> > >>> wrote: > >>>> Hello, > >>>> > >>>> I have small script > >>>> > >>>> $NUTCH_PATH inject crawl/crawldb seed.txt > >>>> $NUTCH_PATH generate crawl/crawldb crawl/crawldb/segments -adddays 0 > >>>> > >>>> s1=`ls -d crawl/crawldb/segments/* | tail -1` > >>>> $NUTCH_PATH fetch $s1 > >>>> $NUTCH_PATH parse $s1 > >>>> $NUTCH_PATH updatedb crawl/crawldb $s1 > >>>> > >>>> In seed.txt I have just one site, for example "test.com". When I > start > >>>> script it falls on fetch phase. > >>>> If I change test.com on www.test.com it works fine. Seems the reason, > >>> that > >>>> outgoing link on test.com all have www. prefix. > >>>> What I need to change in nutch config for work with test.com? > >>>> > >>>> Thank you in advance. I hope my explanation is clear :) > >>>> > >>>> -- > >>>> Alexei A. Korolev > >>> > >>> > >>> > >>> -- > >>> Lewis > >>> > >> > >> > >> > >> -- > >> Alexei A. Korolev > > > > -- Alexei A. Korolev

