Re: [MASSMAIL]Nutch not crawling links inside RSS Feeds

Jorge Luis Betancourt Gonzalez Mon, 25 May 2015 20:08:02 -0700

I don’t think you’ll need to modify the parse-plugin.xml because Tika (the 
default parser) is capable of handling RSS feeds [1]. Second using the default 
Nutch distribution without any change, executing a parse checker against the 
URL you provided, gives me the following output:

$  bin/nutch parsechecker 
http://timesofindia.indiatimes.com/rssfeedsdefault.cms 
<http://timesofindia.indiatimes.com/rssfeedsdefault.cms>

fetching: http://timesofindia.indiatimes.com/rssfeedsdefault.cms
parsing: http://timesofindia.indiatimes.com/rssfeedsdefault.cms
contentType: application/rss+xml
signature: e277b1d141680fe4afdc68dfb591503b
---------
Url
---------------

http://timesofindia.indiatimes.com/rssfeedsdefault.cms
---------
ParseData
---------

Version: 5
Status: success(1,0)
Title: The Times of India
Outlinks: 18
  outlink: toUrl: 
http://timesofindia.indiatimes.com/india/TOI-rating-With-77-5-Modi-govt-gets-distinction-in-its-first-year/articleshow/47422949.cms
 anchor: TOI rating: With 77.5%, Modi govt gets distinction in its first year
  outlink: toUrl: 
http://timesofindia.indiatimes.com/india/Black-money-Switzerland-discloses-names-of-two-Indians/articleshow/47420853.cms
 anchor: Black money: Switzerland discloses 2 names
  outlink: toUrl: 
http://timesofindia.indiatimes.com/india/Ex-PM-Manmohan-Singh-told-me-to-go-along-on-2G-Baijal/articleshow/47423096.cms
 anchor: Ex-PM told me to go along on 2G: Baijal
  outlink: toUrl: 
http://timesofindia.indiatimes.com/india/Bofors-scandal-was-a-media-trial-President-Pranab-Mukherjee/articleshow/47423220.cms
 anchor: Bofors scandal was a media trial: President Pranab
  outlink: toUrl: 
http://timesofindia.indiatimes.com/city/hyderabad/100-more-succumb-to-sunstroke-in-Andhra-Pradesh-and-Telangana-toll-nears-600/articleshow/47423351.cms
 anchor: Heat wave: Death toll in AP, Telangana nears 600
  outlink: toUrl: 
http://timesofindia.indiatimes.com/city/chennai/Filipino-woman-converts-in-Chennai-Hindu-outfit-calls-it-ghar-wapsi/articleshow/47422977.cms
 anchor: 'Ghar wapsi' mars Filipina conversion
  outlink: toUrl: 
http://timesofindia.indiatimes.com/city/kolkata/Bengal-minister-shoes-his-team-whos-the-boss/articleshow/47423596.cms
 anchor: Bengal minister 'shoes' his team who's the boss
  outlink: toUrl: 
http://timesofindia.indiatimes.com/india/Congress-blasts-NDA-calls-achhe-din-a-jumla/articleshow/47424084.cms
 anchor: Congress blasts NDA, calls achhe din a 'jumla'
  outlink: toUrl: 
http://timesofindia.indiatimes.com/world/us/No-time-to-eat-Silicon-Valley-drinks-its-meals/articleshow/47424226.cms
 anchor: No time to eat, Silicon Valley drinks its meals
  outlink: toUrl: 
http://timesofindia.indiatimes.com/city/kolkata/Lens-on-2-firms-for-buying-Didis-paintings/articleshow/47421881.cms
 anchor: Lens on 2 firms for buying Didi's paintings
  outlink: toUrl: 
http://timesofindia.indiatimes.com/india/Bad-days-are-here-for-power-brokers-PM-Narendra-Modi/articleshow/47424041.cms
 anchor: Bad days are here for power brokers: PM Modi
  outlink: toUrl: 
http://timesofindia.indiatimes.com/india/Gadget-greed-leads-Gujarat-girl-13-to-prostitution/articleshow/47423265.cms
 anchor: Gadget greed leads Gujarat girl, 13, to prostitution
  outlink: toUrl: 
http://timesofindia.indiatimes.com/city/mumbai/Government-scanner-on-implant-overcharging/articleshow/47423821.cms
 anchor: Government scanner on implant overcharging
  outlink: toUrl: 
http://timesofindia.indiatimes.com/india/On-anniversary-eve-RSS-VHP-bring-up-Ram-temple/articleshow/47422947.cms
 anchor: On anniversary eve, RSS, VHP bring up Ram temple
  outlink: toUrl: 
http://timesofindia.indiatimes.com/india/HC-restores-ACBs-power-to-act-against-any-govt-official/articleshow/47423083.cms
 anchor: ACB can act against govt officials: HC
  outlink: toUrl: 
http://timesofindia.indiatimes.com/home/education/news/Delhi-girl-tops-CBSE-exam-with-496/500-in-commerce/articleshow/47423362.cms
 anchor: M Gayatri, a Delhi girl, tops CBSE 12th exam
  outlink: toUrl: 
http://timesofindia.indiatimes.com/india/PM-doesnt-mention-one-rank-one-pension-leaves-ex-servicemen-disappointed/articleshow/47423981.cms
 anchor: Forces gutted by fresh letdown on pension parity
  outlink: toUrl: 
http://timesofindia.indiatimes.com/india/7-months-on-child-rights-panel-exists-only-on-paper/articleshow/47423484.cms
 anchor: 7 mths on, child rights panel exists only on paper
Content Metadata: Content-Language=en-US Age=384 Content-Length=3801 
Expires=Tue, 26 May 2015 02:56:18 GMT Last-Modified=Tue, 26 May 2015 02:46:18 
GMT Connection=keep-alive X-Cache-Lookup=HIT from opv.uci.cu:3128 
Server=Apache/2.2.15 (CentOS) X-Cache=HIT from opv.uci.cu Vary=Accept-Encoding 
Date=Tue, 26 May 2015 02:49:25 GMT CacheControl=public nutch.crawl.score=0.0 
Content-Encoding=gzip Via=1.0 opv.uci.cu (squid/3.1.10) 
Content-Type=text/xml;charset=UTF-8
Parse Metadata: description=Times of India brings the Latest & Top Breaking 
News on Politics and Current Affairs in India & around the World, Cricket, 
Sports, Business, Bollywood News and Entertainment, Science, Technology, Health 
& Fitness news & opinions from leading columnists. 
Content-Type=application/rss+xml dc:description=Times of India brings the 
Latest & Top Breaking News on Politics and Current Affairs in India & around 
the World, Cricket, Sports, Business, Bollywood News and Entertainment, 
Science, Technology, Health & Fitness news & opinions from leading columnists. 
dc:title=The Times of India

So the tika parser is working as expected and identifying all the outlinks 
present in the RSS feed, so this brings up some questions:

1. Which command are you using to execute Nutch?
2. What are you trying to do by configuring the regex URLFilter? Are you trying 
to restrict your crawl to only this site? If this is the case, perhaps you 
should use the urlfilter-domain plugin, for instance just activate the 
urlfilter-domain and configure “timesofindia.com” in the 
con/domain-urlfilter.txt

If my memory is not playing tricks on me I think that in the regex URL filter 
plugin you need to escape special characters as: (.) 

Hope it helps, 

[1] https://tika.apache.org/0.9/formats.html 
<https://tika.apache.org/0.9/formats.html>
> On May 25, 2015, at 10:15 PM, Ankit Goel <[email protected]> wrote:
> 
> Hi,
> I'm using Nutch 1.9 with Solr 4.9.
> The plugin for crawling rss feeds is shipped with the binary as noted on
> the site, but I have found almost no clear literature on
> activating/including that plugin, or if it is already activated.
> Running a crawl with the seed as a rss site (
> http://timesofindia.indiatimes.com/rssfeedsdefault.cms), the seed is
> parsed, but none of the links are crawled and the process just ends. The
> regex urlfilter is set to permit all links from that site.
> +^http://timesofindia.com
> +^http://timesofindia.indiatimes.com
> +^http://timesofindia.indiatimes.com/rssfeedsdefault.cms
> 
> But I am not sure if any changes need to be made to nutch-site and
> parse-plugin.xml. Current nutch-site has
> 
> <name>plugin.includes</name>
> 
> <value>protocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|metadata|more)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
> 
> parse-plugin is unchanged- as shipped. Parts of it are as follows :
> 
> <mimeType name="application/rss+xml">
>    <plugin id="parse-tika" />
>    <plugin id="feed" />
> </mimeType>
> 
> <mimeType name="text/html">
> <plugin id="parse-tika" />
> </mimeType>
> 
> <mimeType name="application/xhtml+xml">
> <plugin id="parse-tika" />
> </mimeType>
> 
> <mimeType name="text/xml">
> <plugin id="parse-tika" />
> <plugin id="feed" />
> </mimeType>
> 
> 
> -- 
> Regards,
> Ankit Goel
> http://about.me/ankitgoel

Re: [MASSMAIL]Nutch not crawling links inside RSS Feeds

Reply via email to