Nutch not crawling links inside RSS Feeds

Ankit Goel Mon, 25 May 2015 19:16:10 -0700

Hi,
I'm using Nutch 1.9 with Solr 4.9.
The plugin for crawling rss feeds is shipped with the binary as noted on
the site, but I have found almost no clear literature on
activating/including that plugin, or if it is already activated.
Running a crawl with the seed as a rss site (
http://timesofindia.indiatimes.com/rssfeedsdefault.cms), the seed is
parsed, but none of the links are crawled and the process just ends. The
regex urlfilter is set to permit all links from that site.
+^http://timesofindia.com
+^http://timesofindia.indiatimes.com
+^http://timesofindia.indiatimes.com/rssfeedsdefault.cms


But I am not sure if any changes need to be made to nutch-site and
parse-plugin.xml. Current nutch-site has

<name>plugin.includes</name>

<value>protocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|metadata|more)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)</value>

parse-plugin is unchanged- as shipped. Parts of it are as follows :

<mimeType name="application/rss+xml">
    <plugin id="parse-tika" />
    <plugin id="feed" />
</mimeType>

<mimeType name="text/html">
<plugin id="parse-tika" />
</mimeType>

<mimeType name="application/xhtml+xml">
<plugin id="parse-tika" />
</mimeType>

<mimeType name="text/xml">
<plugin id="parse-tika" />
<plugin id="feed" />
</mimeType>


-- 
Regards,
Ankit Goel
http://about.me/ankitgoel

Nutch not crawling links inside RSS Feeds

Reply via email to