I run this nutch inject urls nutch generate bin/nutch crawl urls -depth 3 -topN 100 bin/nutch solrindex http://127.0.0.1:8983/solr/ -reindex echo Crawling completed dir
then I see alot of urls being fetched during the crawl phase. When I run the solrindex it doesn't add all the urls i see when it says fetching 54 URLs in 5 queues fetching http://www.tarpits.org/join-us fetching http://www.leonisadobemuseum.org/history-leonis.asp fetching http://az.wikipedia.org/wiki/Quercus_prinus It doesn't add wikipedia nor the others. ADDITIONAL INFO : My regex-urlfilter.txt # skip file: ftp: and mailto: urls -^(file|ftp|mailto): # skip image and other suffixes we can't yet parse # for a more extensive coverage use the urlfilter-suffix plugin -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$ # skip URLs containing certain characters as probable queries, etc. -[?*!@=] # skip URLs with slash-delimited segment that repeats 3+ times, to break loops -.*(/[^/]+)/[^/]+\1/[^/]+\1/ # accept anything else +. ################################################################# ADDITIONAL INFO : Running on solr 4.0 nutch 2.0

