Ken, Julien seemed to be the only person actively working on CC and his last post on that JIRA seemed to say that we stick to using nutch parser for the robots.txt and that Crawler Common is not very active. I am a complete beginner in using nutch - so not sure which way to go. I will give CC a spin and see if I can get around this robots,txt problem and also open a JIRA so that this issue is resolved inside nutch. Will report back if I meet with any (hard) luck! :)
-Arijit ________________________________ From: Ken Krugler <[email protected]> To: [email protected] Sent: Monday, July 2, 2012 10:56 PM Subject: Re: parsechecker fetches url but fetcher fails On Jul 2, 2012, at 5:00am, arijit wrote: Hi, > Since learning that nutch will be unable to crawl the javascript function >calls in href, I started looking for other alternatives. I decided to crawl >http://en.wikipedia.org/wiki/Districts_of_India. > I first tried injecting this URL and follow the step-by-step approach till >fetcher - when I realized, nutch did not fetch anything from this website. I >tried looking into logs/hadoop.log and found the following 3 lines - which I >believe could be saying that nutch is unable to parse the robots.txt in the >website and ttherefore, fetcher stopped? > > > > 2012-07-02 16:41:07,452 WARN api.RobotRulesParser - error parsing robots >rules- can't decode path: /wiki/Wikipedia%3Mediation_Committee/ > 2012-07-02 16:41:07,452 WARN api.RobotRulesParser - error parsing robots rules- can't decode path: /wiki/Wikipedia_talk%3Mediation_Committee/ > 2012-07-02 16:41:07,452 WARN api.RobotRulesParser - error parsing robots >rules- can't decode path: /wiki/Wikipedia%3Mediation_Cabal/Cases/ The issue is that the Wikipedia robots.txt file contains malformed URLs - these three are missing the 'A' from the %3A sequence. I tried checking the URL using parsechecker and no issues there! I think it means that the robots.txt is malformed for this website, which is preventing fetcher from fetching anything. Is there a way to get around this problem, as parsechecker seems to go on its merry way parsing. This is an example of where having Nutch use crawler-commons robots.txt parser would help :) https://issues.apache.org/jira/browse/NUTCH-1031 -- Ken -------------------------- Ken Krugler http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Mahout & Solr

