I actually didnt have it specified, I now put this in the nutch-site.xml looks like this.
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>http.agent.name</name> <value>Balsa Crawler</value> </property> <property> <name>db.ignore.external.links</name> <value>false</value> <description>If true, outlinks leading from a page to external hosts will be ignored. This is an effective way to limit the crawl to include only initially injected hosts, without creating complex URLFilters. </description> </property> <property> <name>storage.data.store.class</name> <value>org.apache.gora.sql.store.SqlStore</value> <description>The Gora DataStore class for storing and retrieving data. Currently the following stores are available: .. </description> </property> </configuration> On Sat, Aug 18, 2012 at 12:15 AM, Stefan Scheffler < [email protected]> wrote: > Did you set db.ignore.external in *conf/nutch-site.xml*? > This avoids that external links are fetched. > Another problem could be, that the robots.txt of the servers prevents the > crawler from fetching. > you can check this with *bin/nutch readdb*. There you see, if the sites > are really fetched > regards > Stefan > > Am 18.08.2012 09:07, schrieb Robert Irribarren: > > I run this >> nutch inject urls >> nutch generate >> bin/nutch crawl urls -depth 3 -topN 100 >> bin/nutch solrindex http://127.0.0.1:8983/solr/ -reindex >> echo Crawling completed >> dir >> >> then I see alot of urls being fetched during the crawl phase. >> When I run the solrindex it doesn't add all the urls i see when it says >> fetching >> >> 54 URLs in 5 queues >> fetching http://www.tarpits.org/join-us >> fetching >> http://www.leonisadobemuseum.**org/history-leonis.asp<http://www.leonisadobemuseum.org/history-leonis.asp> >> fetching >> http://az.wikipedia.org/wiki/**Quercus_prinus<http://az.wikipedia.org/wiki/Quercus_prinus> >> >> It doesn't add wikipedia nor the others. >> >> ADDITIONAL INFO : >> My regex-urlfilter.txt >> # skip file: ftp: and mailto: urls >> -^(file|ftp|mailto): >> >> # skip image and other suffixes we can't yet parse >> # for a more extensive coverage use the urlfilter-suffix plugin >> -\.(gif|GIF|jpg|JPG|png|PNG|**ico|ICO|css|CSS|sit|SIT|eps|** >> EPS|wmf|WMF|zip|ZIP|ppt|PPT|**mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|** >> tgz|TGZ|mov|MOV|exe|EXE|jpeg|**JPEG|bmp|BMP|js|JS)$ >> >> # skip URLs containing certain characters as probable queries, etc. >> -[?*!@=] >> >> # skip URLs with slash-delimited segment that repeats 3+ times, to break >> loops >> -.*(/[^/]+)/[^/]+\1/[^/]+\1/ >> >> # accept anything else >> +. >> ##############################**##############################**##### >> >> ADDITIONAL INFO : Running on solr 4.0 nutch 2.0 >> >> >

