not crawling external links

Shameema Umer Thu, 07 Jun 2012 23:40:06 -0700

My *nutch default* contains

<property>
 <name>db.ignore.external.links</name>
 <value>false</value>
 <description>If true, outlinks leading from a page to external hosts
 will be ignored. This is an effective way to limit the crawl to include
 only initially injected hosts, without creating complex URLFilters.
 </description>
</property>


*seed*
http://feeds.bbci.co.uk/news/business/rss.xml

*regex url filter*

+^http://([a-z0-9]*\.)*feeds.bbci.co.uk/news/business/rss.xml
+^http://([a-z0-9]*\.)*www.bbc.co.uk/news/

*Crawl*
$ bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 2 -topN 70


The crawl does not fetch any www.bbc.co.uk/news pages eventhough all links
in http://feeds.bbci.co.uk/news/business/rss.xml are pointing to
www.bbc.co.uk/news. Please let me know where i m wrong.


Thanks in advance.
Shameema

not crawling external links

Reply via email to