Hi,
I'm using Nutch 1.9 with Solr 4.9.
The plugin for crawling rss feeds is shipped with the binary as noted on
the site, but I have found almost no clear literature on
activating/including that plugin, or if it is already activated.
Running a crawl with the seed as a rss site (
http://timesofindia.indiatimes.com/rssfeedsdefault.cms), the seed is
parsed, but none of the links are crawled and the process just ends. The
regex urlfilter is set to permit all links from that site.
+^http://timesofindia.com
+^http://timesofindia.indiatimes.com
+^http://timesofindia.indiatimes.com/rssfeedsdefault.cms
But I am not sure if any changes need to be made to nutch-site and
parse-plugin.xml. Current nutch-site has
<name>plugin.includes</name>
<value>protocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|metadata|more)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
parse-plugin is unchanged- as shipped. Parts of it are as follows :
<mimeType name="application/rss+xml">
<plugin id="parse-tika" />
<plugin id="feed" />
</mimeType>
<mimeType name="text/html">
<plugin id="parse-tika" />
</mimeType>
<mimeType name="application/xhtml+xml">
<plugin id="parse-tika" />
</mimeType>
<mimeType name="text/xml">
<plugin id="parse-tika" />
<plugin id="feed" />
</mimeType>
--
Regards,
Ankit Goel
http://about.me/ankitgoel