I don’t think you’ll need to modify the parse-plugin.xml because Tika (the default parser) is capable of handling RSS feeds [1]. Second using the default Nutch distribution without any change, executing a parse checker against the URL you provided, gives me the following output:
$ bin/nutch parsechecker http://timesofindia.indiatimes.com/rssfeedsdefault.cms <http://timesofindia.indiatimes.com/rssfeedsdefault.cms> fetching: http://timesofindia.indiatimes.com/rssfeedsdefault.cms parsing: http://timesofindia.indiatimes.com/rssfeedsdefault.cms contentType: application/rss+xml signature: e277b1d141680fe4afdc68dfb591503b --------- Url --------------- http://timesofindia.indiatimes.com/rssfeedsdefault.cms --------- ParseData --------- Version: 5 Status: success(1,0) Title: The Times of India Outlinks: 18 outlink: toUrl: http://timesofindia.indiatimes.com/india/TOI-rating-With-77-5-Modi-govt-gets-distinction-in-its-first-year/articleshow/47422949.cms anchor: TOI rating: With 77.5%, Modi govt gets distinction in its first year outlink: toUrl: http://timesofindia.indiatimes.com/india/Black-money-Switzerland-discloses-names-of-two-Indians/articleshow/47420853.cms anchor: Black money: Switzerland discloses 2 names outlink: toUrl: http://timesofindia.indiatimes.com/india/Ex-PM-Manmohan-Singh-told-me-to-go-along-on-2G-Baijal/articleshow/47423096.cms anchor: Ex-PM told me to go along on 2G: Baijal outlink: toUrl: http://timesofindia.indiatimes.com/india/Bofors-scandal-was-a-media-trial-President-Pranab-Mukherjee/articleshow/47423220.cms anchor: Bofors scandal was a media trial: President Pranab outlink: toUrl: http://timesofindia.indiatimes.com/city/hyderabad/100-more-succumb-to-sunstroke-in-Andhra-Pradesh-and-Telangana-toll-nears-600/articleshow/47423351.cms anchor: Heat wave: Death toll in AP, Telangana nears 600 outlink: toUrl: http://timesofindia.indiatimes.com/city/chennai/Filipino-woman-converts-in-Chennai-Hindu-outfit-calls-it-ghar-wapsi/articleshow/47422977.cms anchor: 'Ghar wapsi' mars Filipina conversion outlink: toUrl: http://timesofindia.indiatimes.com/city/kolkata/Bengal-minister-shoes-his-team-whos-the-boss/articleshow/47423596.cms anchor: Bengal minister 'shoes' his team who's the boss outlink: toUrl: http://timesofindia.indiatimes.com/india/Congress-blasts-NDA-calls-achhe-din-a-jumla/articleshow/47424084.cms anchor: Congress blasts NDA, calls achhe din a 'jumla' outlink: toUrl: http://timesofindia.indiatimes.com/world/us/No-time-to-eat-Silicon-Valley-drinks-its-meals/articleshow/47424226.cms anchor: No time to eat, Silicon Valley drinks its meals outlink: toUrl: http://timesofindia.indiatimes.com/city/kolkata/Lens-on-2-firms-for-buying-Didis-paintings/articleshow/47421881.cms anchor: Lens on 2 firms for buying Didi's paintings outlink: toUrl: http://timesofindia.indiatimes.com/india/Bad-days-are-here-for-power-brokers-PM-Narendra-Modi/articleshow/47424041.cms anchor: Bad days are here for power brokers: PM Modi outlink: toUrl: http://timesofindia.indiatimes.com/india/Gadget-greed-leads-Gujarat-girl-13-to-prostitution/articleshow/47423265.cms anchor: Gadget greed leads Gujarat girl, 13, to prostitution outlink: toUrl: http://timesofindia.indiatimes.com/city/mumbai/Government-scanner-on-implant-overcharging/articleshow/47423821.cms anchor: Government scanner on implant overcharging outlink: toUrl: http://timesofindia.indiatimes.com/india/On-anniversary-eve-RSS-VHP-bring-up-Ram-temple/articleshow/47422947.cms anchor: On anniversary eve, RSS, VHP bring up Ram temple outlink: toUrl: http://timesofindia.indiatimes.com/india/HC-restores-ACBs-power-to-act-against-any-govt-official/articleshow/47423083.cms anchor: ACB can act against govt officials: HC outlink: toUrl: http://timesofindia.indiatimes.com/home/education/news/Delhi-girl-tops-CBSE-exam-with-496/500-in-commerce/articleshow/47423362.cms anchor: M Gayatri, a Delhi girl, tops CBSE 12th exam outlink: toUrl: http://timesofindia.indiatimes.com/india/PM-doesnt-mention-one-rank-one-pension-leaves-ex-servicemen-disappointed/articleshow/47423981.cms anchor: Forces gutted by fresh letdown on pension parity outlink: toUrl: http://timesofindia.indiatimes.com/india/7-months-on-child-rights-panel-exists-only-on-paper/articleshow/47423484.cms anchor: 7 mths on, child rights panel exists only on paper Content Metadata: Content-Language=en-US Age=384 Content-Length=3801 Expires=Tue, 26 May 2015 02:56:18 GMT Last-Modified=Tue, 26 May 2015 02:46:18 GMT Connection=keep-alive X-Cache-Lookup=HIT from opv.uci.cu:3128 Server=Apache/2.2.15 (CentOS) X-Cache=HIT from opv.uci.cu Vary=Accept-Encoding Date=Tue, 26 May 2015 02:49:25 GMT CacheControl=public nutch.crawl.score=0.0 Content-Encoding=gzip Via=1.0 opv.uci.cu (squid/3.1.10) Content-Type=text/xml;charset=UTF-8 Parse Metadata: description=Times of India brings the Latest & Top Breaking News on Politics and Current Affairs in India & around the World, Cricket, Sports, Business, Bollywood News and Entertainment, Science, Technology, Health & Fitness news & opinions from leading columnists. Content-Type=application/rss+xml dc:description=Times of India brings the Latest & Top Breaking News on Politics and Current Affairs in India & around the World, Cricket, Sports, Business, Bollywood News and Entertainment, Science, Technology, Health & Fitness news & opinions from leading columnists. dc:title=The Times of India So the tika parser is working as expected and identifying all the outlinks present in the RSS feed, so this brings up some questions: 1. Which command are you using to execute Nutch? 2. What are you trying to do by configuring the regex URLFilter? Are you trying to restrict your crawl to only this site? If this is the case, perhaps you should use the urlfilter-domain plugin, for instance just activate the urlfilter-domain and configure “timesofindia.com” in the con/domain-urlfilter.txt If my memory is not playing tricks on me I think that in the regex URL filter plugin you need to escape special characters as: (.) Hope it helps, [1] https://tika.apache.org/0.9/formats.html <https://tika.apache.org/0.9/formats.html> > On May 25, 2015, at 10:15 PM, Ankit Goel <[email protected]> wrote: > > Hi, > I'm using Nutch 1.9 with Solr 4.9. > The plugin for crawling rss feeds is shipped with the binary as noted on > the site, but I have found almost no clear literature on > activating/including that plugin, or if it is already activated. > Running a crawl with the seed as a rss site ( > http://timesofindia.indiatimes.com/rssfeedsdefault.cms), the seed is > parsed, but none of the links are crawled and the process just ends. The > regex urlfilter is set to permit all links from that site. > +^http://timesofindia.com > +^http://timesofindia.indiatimes.com > +^http://timesofindia.indiatimes.com/rssfeedsdefault.cms > > But I am not sure if any changes need to be made to nutch-site and > parse-plugin.xml. Current nutch-site has > > <name>plugin.includes</name> > > <value>protocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|metadata|more)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)</value> > > parse-plugin is unchanged- as shipped. Parts of it are as follows : > > <mimeType name="application/rss+xml"> > <plugin id="parse-tika" /> > <plugin id="feed" /> > </mimeType> > > <mimeType name="text/html"> > <plugin id="parse-tika" /> > </mimeType> > > <mimeType name="application/xhtml+xml"> > <plugin id="parse-tika" /> > </mimeType> > > <mimeType name="text/xml"> > <plugin id="parse-tika" /> > <plugin id="feed" /> > </mimeType> > > > -- > Regards, > Ankit Goel > http://about.me/ankitgoel

