Did you set db.ignore.external in *conf/nutch-site.xml*?
This avoids that external links are fetched.
Another problem could be, that the robots.txt of the servers prevents
the crawler from fetching.
you can check this with *bin/nutch readdb*. There you see, if the sites
are really fetched
regards
Stefan
Am 18.08.2012 09:07, schrieb Robert Irribarren:
I run this
nutch inject urls
nutch generate
bin/nutch crawl urls -depth 3 -topN 100
bin/nutch solrindex http://127.0.0.1:8983/solr/ -reindex
echo Crawling completed
dir
then I see alot of urls being fetched during the crawl phase.
When I run the solrindex it doesn't add all the urls i see when it says
fetching
54 URLs in 5 queues
fetching http://www.tarpits.org/join-us
fetching http://www.leonisadobemuseum.org/history-leonis.asp
fetching http://az.wikipedia.org/wiki/Quercus_prinus
It doesn't add wikipedia nor the others.
ADDITIONAL INFO :
My regex-urlfilter.txt
# skip file: ftp: and mailto: urls
-^(file|ftp|mailto):
# skip image and other suffixes we can't yet parse
# for a more extensive coverage use the urlfilter-suffix plugin
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$
# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]
# skip URLs with slash-delimited segment that repeats 3+ times, to break
loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/
# accept anything else
+.
#################################################################
ADDITIONAL INFO : Running on solr 4.0 nutch 2.0