Re: crawling site without www

Alexei Korolev Tue, 07 Aug 2012 07:05:41 -0700

Hello,

Thank you for reply.


Here is my regex-urlfilter.txt

# The default url filter.
# Better for whole-internet crawling.

# Each non-comment, non-blank line contains a regular expression
# prefixed by '+' or '-'.  The first matching pattern in the file
# determines whether a URL is included or ignored.  If no pattern
# matches, the URL is ignored.

# skip file: ftp: and mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
# for a more extensive coverage use the urlfilter-suffix plugin
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$

# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]

# skip URLs with slash-delimited segment that repeats 3+ times, to break
loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/

# accept anything else
+.


and prefix-urlfilter.txt

# config file for urlfilter-prefix plugin

http://
https://
ftp://
file://


Looks all fine for me. Right?

On Sat, Aug 4, 2012 at 11:16 PM, Sebastian Nagel <[email protected]
> wrote:

> Hi Alexei,
>
> Because users are lazy some browser automatically
> try to add the www (and other stuff) to escape from
> a "server not found" error, see
> http://www-archive.mozilla.org/docs/end-user/domain-guessing.html
>
> Nutch does no domain guessing. The urls have to be correct
> and the host name must be complete.
>
> Finally, even if test.com sends a HTTP redirect pointing
> to www.test.com : check your URL filters whether both
> hosts are accepted.
>
> Sebastian
>
> On 08/04/2012 05:33 PM, Mathijs Homminga wrote:> What do you mean exactly
> with "it falls on fetch
> phase"?
> > Do  you get an error?
> > Does "test.com" exist?
> > Does it perhaps redirect to "www.test.com"?
> > ...
> >
> > Mathijs
> >
> > On Aug 4, 2012, at 17:11 , Alexei Korolev <[email protected]>
> wrote:
> >
> >> yes
> >>
> >> On Sat, Aug 4, 2012 at 6:11 PM, Lewis John Mcgibbney <
> >> [email protected]> wrote:
> >>
> >>> http://   ?
> >>>
> >>> hth
> >>>
> >>> On Fri, Aug 3, 2012 at 9:53 AM, Alexei Korolev <
> [email protected]>
> >>> wrote:
> >>>> Hello,
> >>>>
> >>>> I have small script
> >>>>
> >>>> $NUTCH_PATH inject crawl/crawldb seed.txt
> >>>> $NUTCH_PATH generate crawl/crawldb crawl/crawldb/segments -adddays 0
> >>>>
> >>>> s1=`ls -d crawl/crawldb/segments/* | tail -1`
> >>>> $NUTCH_PATH fetch $s1
> >>>> $NUTCH_PATH parse $s1
> >>>> $NUTCH_PATH updatedb crawl/crawldb $s1
> >>>>
> >>>> In seed.txt I have just one site, for example "test.com". When I
> start
> >>>> script it falls on fetch phase.
> >>>> If I change test.com on www.test.com it works fine. Seems the reason,
> >>> that
> >>>> outgoing link on test.com all have www. prefix.
> >>>> What I need to change in nutch config for work with test.com?
> >>>>
> >>>> Thank you in advance. I hope my explanation is clear :)
> >>>>
> >>>> --
> >>>> Alexei A. Korolev
> >>>
> >>>
> >>>
> >>> --
> >>> Lewis
> >>>
> >>
> >>
> >>
> >> --
> >> Alexei A. Korolev
> >
>
>


-- 
Alexei A. Korolev

Re: crawling site without www

Reply via email to