Re: [MASSMAIL]Nutch not crawling links inside RSS Feeds

Ankit Goel Tue, 02 Jun 2015 19:10:28 -0700

Hi Jorge,
Thanks for the advice. Turns out there was an error on the solr schema
side. Since I had added the plugin for metatags and description, I was
getting double entries for each. So I was getting multi entries for the
single valued keyword and description fields. While that is an error too
(one which I dont seem to have found a solution online), for now changing
those fields to multiValue has allowed solr to index without problems.


lastly I'm not so clear about escaping special characters in regex
urlfilter. The tutorial adds +^http://([a-z0-9]*\.)apache.nutch.org/
So in the brackets the "." is escaped, but outside in the url it is not.

Also, and I know I might be stretching this by putting this in the same
post, have you experimented with boilerpipe and nutch 1.9. I have activated
it, atleast I think I have, but it doesnt work. Running boilerpipe
seperately in a java program gives fine enough results, though it takes the
link as input.

On Tue, May 26, 2015 at 8:35 AM, Jorge Luis Betancourt Gonzalez <
[email protected]> wrote:

> I don’t think you’ll need to modify the parse-plugin.xml because Tika (the
> default parser) is capable of handling RSS feeds [1]. Second using the
> default Nutch distribution without any change, executing a parse checker
> against the URL you provided, gives me the following output:
>
> $  bin/nutch parsechecker
> http://timesofindia.indiatimes.com/rssfeedsdefault.cms <
> http://timesofindia.indiatimes.com/rssfeedsdefault.cms>
>
> fetching: http://timesofindia.indiatimes.com/rssfeedsdefault.cms
> parsing: http://timesofindia.indiatimes.com/rssfeedsdefault.cms
> contentType: application/rss+xml
> signature: e277b1d141680fe4afdc68dfb591503b
> ---------
> Url
> ---------------
>
> http://timesofindia.indiatimes.com/rssfeedsdefault.cms
> ---------
> ParseData
> ---------
>
> Version: 5
> Status: success(1,0)
> Title: The Times of India
> Outlinks: 18
>   outlink: toUrl:
> http://timesofindia.indiatimes.com/india/TOI-rating-With-77-5-Modi-govt-gets-distinction-in-its-first-year/articleshow/47422949.cms
> anchor: TOI rating: With 77.5%, Modi govt gets distinction in its first year
>   outlink: toUrl:
> http://timesofindia.indiatimes.com/india/Black-money-Switzerland-discloses-names-of-two-Indians/articleshow/47420853.cms
> anchor: Black money: Switzerland discloses 2 names
>   outlink: toUrl:
> http://timesofindia.indiatimes.com/india/Ex-PM-Manmohan-Singh-told-me-to-go-along-on-2G-Baijal/articleshow/47423096.cms
> anchor: Ex-PM told me to go along on 2G: Baijal
>   outlink: toUrl:
> http://timesofindia.indiatimes.com/india/Bofors-scandal-was-a-media-trial-President-Pranab-Mukherjee/articleshow/47423220.cms
> anchor: Bofors scandal was a media trial: President Pranab
>   outlink: toUrl:
> http://timesofindia.indiatimes.com/city/hyderabad/100-more-succumb-to-sunstroke-in-Andhra-Pradesh-and-Telangana-toll-nears-600/articleshow/47423351.cms
> anchor: Heat wave: Death toll in AP, Telangana nears 600
>   outlink: toUrl:
> http://timesofindia.indiatimes.com/city/chennai/Filipino-woman-converts-in-Chennai-Hindu-outfit-calls-it-ghar-wapsi/articleshow/47422977.cms
> anchor: 'Ghar wapsi' mars Filipina conversion
>   outlink: toUrl:
> http://timesofindia.indiatimes.com/city/kolkata/Bengal-minister-shoes-his-team-whos-the-boss/articleshow/47423596.cms
> anchor: Bengal minister 'shoes' his team who's the boss
>   outlink: toUrl:
> http://timesofindia.indiatimes.com/india/Congress-blasts-NDA-calls-achhe-din-a-jumla/articleshow/47424084.cms
> anchor: Congress blasts NDA, calls achhe din a 'jumla'
>   outlink: toUrl:
> http://timesofindia.indiatimes.com/world/us/No-time-to-eat-Silicon-Valley-drinks-its-meals/articleshow/47424226.cms
> anchor: No time to eat, Silicon Valley drinks its meals
>   outlink: toUrl:
> http://timesofindia.indiatimes.com/city/kolkata/Lens-on-2-firms-for-buying-Didis-paintings/articleshow/47421881.cms
> anchor: Lens on 2 firms for buying Didi's paintings
>   outlink: toUrl:
> http://timesofindia.indiatimes.com/india/Bad-days-are-here-for-power-brokers-PM-Narendra-Modi/articleshow/47424041.cms
> anchor: Bad days are here for power brokers: PM Modi
>   outlink: toUrl:
> http://timesofindia.indiatimes.com/india/Gadget-greed-leads-Gujarat-girl-13-to-prostitution/articleshow/47423265.cms
> anchor: Gadget greed leads Gujarat girl, 13, to prostitution
>   outlink: toUrl:
> http://timesofindia.indiatimes.com/city/mumbai/Government-scanner-on-implant-overcharging/articleshow/47423821.cms
> anchor: Government scanner on implant overcharging
>   outlink: toUrl:
> http://timesofindia.indiatimes.com/india/On-anniversary-eve-RSS-VHP-bring-up-Ram-temple/articleshow/47422947.cms
> anchor: On anniversary eve, RSS, VHP bring up Ram temple
>   outlink: toUrl:
> http://timesofindia.indiatimes.com/india/HC-restores-ACBs-power-to-act-against-any-govt-official/articleshow/47423083.cms
> anchor: ACB can act against govt officials: HC
>   outlink: toUrl:
> http://timesofindia.indiatimes.com/home/education/news/Delhi-girl-tops-CBSE-exam-with-496/500-in-commerce/articleshow/47423362.cms
> anchor: M Gayatri, a Delhi girl, tops CBSE 12th exam
>   outlink: toUrl:
> http://timesofindia.indiatimes.com/india/PM-doesnt-mention-one-rank-one-pension-leaves-ex-servicemen-disappointed/articleshow/47423981.cms
> anchor: Forces gutted by fresh letdown on pension parity
>   outlink: toUrl:
> http://timesofindia.indiatimes.com/india/7-months-on-child-rights-panel-exists-only-on-paper/articleshow/47423484.cms
> anchor: 7 mths on, child rights panel exists only on paper
> Content Metadata: Content-Language=en-US Age=384 Content-Length=3801
> Expires=Tue, 26 May 2015 02:56:18 GMT Last-Modified=Tue, 26 May 2015
> 02:46:18 GMT Connection=keep-alive X-Cache-Lookup=HIT from opv.uci.cu:3128
> Server=Apache/2.2.15 (CentOS) X-Cache=HIT from opv.uci.cu
> Vary=Accept-Encoding Date=Tue, 26 May 2015 02:49:25 GMT CacheControl=public
> nutch.crawl.score=0.0 Content-Encoding=gzip Via=1.0 opv.uci.cu
> (squid/3.1.10) Content-Type=text/xml;charset=UTF-8
> Parse Metadata: description=Times of India brings the Latest & Top
> Breaking News on Politics and Current Affairs in India & around the World,
> Cricket, Sports, Business, Bollywood News and Entertainment, Science,
> Technology, Health & Fitness news & opinions from leading columnists.
> Content-Type=application/rss+xml dc:description=Times of India brings the
> Latest & Top Breaking News on Politics and Current Affairs in India &
> around the World, Cricket, Sports, Business, Bollywood News and
> Entertainment, Science, Technology, Health & Fitness news & opinions from
> leading columnists. dc:title=The Times of India
>
> So the tika parser is working as expected and identifying all the outlinks
> present in the RSS feed, so this brings up some questions:
>
> 1. Which command are you using to execute Nutch?
> 2. What are you trying to do by configuring the regex URLFilter? Are you
> trying to restrict your crawl to only this site? If this is the case,
> perhaps you should use the urlfilter-domain plugin, for instance just
> activate the urlfilter-domain and configure “timesofindia.com” in the
> con/domain-urlfilter.txt
>
> If my memory is not playing tricks on me I think that in the regex URL
> filter plugin you need to escape special characters as: (.)
>
> Hope it helps,
>
> [1] https://tika.apache.org/0.9/formats.html <
> https://tika.apache.org/0.9/formats.html>
> > On May 25, 2015, at 10:15 PM, Ankit Goel <[email protected]>
> wrote:
> >
> > Hi,
> > I'm using Nutch 1.9 with Solr 4.9.
> > The plugin for crawling rss feeds is shipped with the binary as noted on
> > the site, but I have found almost no clear literature on
> > activating/including that plugin, or if it is already activated.
> > Running a crawl with the seed as a rss site (
> > http://timesofindia.indiatimes.com/rssfeedsdefault.cms), the seed is
> > parsed, but none of the links are crawled and the process just ends. The
> > regex urlfilter is set to permit all links from that site.
> > +^http://timesofindia.com
> > +^http://timesofindia.indiatimes.com
> > +^http://timesofindia.indiatimes.com/rssfeedsdefault.cms
> >
> > But I am not sure if any changes need to be made to nutch-site and
> > parse-plugin.xml. Current nutch-site has
> >
> > <name>plugin.includes</name>
> >
> >
> <value>protocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|metadata|more)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
> >
> > parse-plugin is unchanged- as shipped. Parts of it are as follows :
> >
> > <mimeType name="application/rss+xml">
> >    <plugin id="parse-tika" />
> >    <plugin id="feed" />
> > </mimeType>
> >
> > <mimeType name="text/html">
> > <plugin id="parse-tika" />
> > </mimeType>
> >
> > <mimeType name="application/xhtml+xml">
> > <plugin id="parse-tika" />
> > </mimeType>
> >
> > <mimeType name="text/xml">
> > <plugin id="parse-tika" />
> > <plugin id="feed" />
> > </mimeType>
> >
> >
> > --
> > Regards,
> > Ankit Goel
> > http://about.me/ankitgoel
>
>
>


-- 
Regards,
Ankit Goel
http://about.me/ankitgoel

Re: [MASSMAIL]Nutch not crawling links inside RSS Feeds

Reply via email to