I actually didnt have it specified, I now put this in the nutch-site.xml
looks like this.

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
<property>
<name>http.agent.name</name>
<value>Balsa  Crawler</value>
</property>

<property>
  <name>db.ignore.external.links</name>
  <value>false</value>
  <description>If true, outlinks leading from a page to external hosts
  will be ignored. This is an effective way to limit the crawl to include
  only initially injected hosts, without creating complex URLFilters.
  </description>
</property>

<property>
<name>storage.data.store.class</name>
<value>org.apache.gora.sql.store.SqlStore</value>
<description>The Gora DataStore class for storing and retrieving data.
Currently the following stores are available: ..
</description>
</property>

</configuration>


On Sat, Aug 18, 2012 at 12:15 AM, Stefan Scheffler <
[email protected]> wrote:

> Did you set db.ignore.external in *conf/nutch-site.xml*?
> This avoids that external links are fetched.
> Another problem could be, that the robots.txt of the servers prevents the
> crawler from fetching.
> you can check this with *bin/nutch readdb*. There you see, if the sites
> are really fetched
> regards
> Stefan
>
> Am 18.08.2012 09:07, schrieb Robert Irribarren:
>
>  I run this
>> nutch inject urls
>> nutch generate
>> bin/nutch crawl urls -depth 3 -topN 100
>> bin/nutch solrindex http://127.0.0.1:8983/solr/ -reindex
>> echo Crawling completed
>> dir
>>
>> then I see alot of urls being fetched during the crawl phase.
>> When I run the solrindex it doesn't add all the urls i see when it says
>> fetching
>>
>> 54 URLs in 5 queues
>> fetching http://www.tarpits.org/join-us
>> fetching 
>> http://www.leonisadobemuseum.**org/history-leonis.asp<http://www.leonisadobemuseum.org/history-leonis.asp>
>> fetching 
>> http://az.wikipedia.org/wiki/**Quercus_prinus<http://az.wikipedia.org/wiki/Quercus_prinus>
>>
>> It doesn't add wikipedia nor the others.
>>
>> ADDITIONAL INFO :
>> My regex-urlfilter.txt
>> # skip file: ftp: and mailto: urls
>> -^(file|ftp|mailto):
>>
>> # skip image and other suffixes we can't yet parse
>> # for a more extensive coverage use the urlfilter-suffix plugin
>> -\.(gif|GIF|jpg|JPG|png|PNG|**ico|ICO|css|CSS|sit|SIT|eps|**
>> EPS|wmf|WMF|zip|ZIP|ppt|PPT|**mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|**
>> tgz|TGZ|mov|MOV|exe|EXE|jpeg|**JPEG|bmp|BMP|js|JS)$
>>
>> # skip URLs containing certain characters as probable queries, etc.
>> -[?*!@=]
>>
>> # skip URLs with slash-delimited segment that repeats 3+ times, to break
>> loops
>> -.*(/[^/]+)/[^/]+\1/[^/]+\1/
>>
>> # accept anything else
>> +.
>> ##############################**##############################**#####
>>
>> ADDITIONAL INFO : Running on solr 4.0 nutch 2.0
>>
>>
>

Reply via email to