Re: urls won't get crawled

Lewis John Mcgibbney Wed, 11 Jan 2012 09:03:48 -0800

Please see comments below

On Wed, Jan 11, 2012 at 2:25 PM, jepse <j...@jepse.net> wrote:


> I'm using that command to crawl:
> bin/nutch crawl urls -dir crawl -solr http://localhost:8983/solr/ -depth 3
>
> i don't have a crawl-urlfilter.txt. I'm using the regex-url-filter.txt with
> following content:
>

Correct crawl-urlfilter.txt was deprecated in a recent release of Nutch.


> /# skip file: ftp: and mailto: urls
> -^(file|ftp|mailto):
>
> # skip image and other suffixes we can't yet parse
>
> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
>
> # skip URLs containing certain characters as probable queries, etc.
> #-[?*!@=]
> -[*!@]
>
> # skip URLs with slash-delimited segment that repeats 3+ times, to break
> loops
> -.*(/[^/]+)/[^/]+\1/[^/]+\1/
>
> # accept anything else
> +./
>

Try explicitly adding the domains you wish to crawl to your
regex-urlfiler.txt e.g.

+^http://([a-z0-9]*\.)*lequipe.fr/
+^http://([a-z0-9]*\.)*ostsee-zitung.de/



>
> i simply removed "?=" as skipping chars.
>
> Have you tried to crawl those urls?
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/urls-won-t-get-crawled-tp3650610p3650742.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>



-- 
*Lewis*

Re: urls won't get crawled

Reply via email to