Please see comments below On Wed, Jan 11, 2012 at 2:25 PM, jepse <j...@jepse.net> wrote:
> I'm using that command to crawl: > bin/nutch crawl urls -dir crawl -solr http://localhost:8983/solr/ -depth 3 > > i don't have a crawl-urlfilter.txt. I'm using the regex-url-filter.txt with > following content: > Correct crawl-urlfilter.txt was deprecated in a recent release of Nutch. > /# skip file: ftp: and mailto: urls > -^(file|ftp|mailto): > > # skip image and other suffixes we can't yet parse > > -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$ > > # skip URLs containing certain characters as probable queries, etc. > #-[?*!@=] > -[*!@] > > # skip URLs with slash-delimited segment that repeats 3+ times, to break > loops > -.*(/[^/]+)/[^/]+\1/[^/]+\1/ > > # accept anything else > +./ > Try explicitly adding the domains you wish to crawl to your regex-urlfiler.txt e.g. +^http://([a-z0-9]*\.)*lequipe.fr/ +^http://([a-z0-9]*\.)*ostsee-zitung.de/ > > i simply removed "?=" as skipping chars. > > Have you tried to crawl those urls? > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/urls-won-t-get-crawled-tp3650610p3650742.html > Sent from the Nutch - User mailing list archive at Nabble.com. > -- *Lewis*