Alexander, As per the following JIRA: https://issues.apache.org/jira/browse/NUTCH-938, they have taken away the ability to ignore robots.txt using "protocol.plugin.check.robots". :( I tried setting protocol.plugin.check.robots and protocol.plugin.check.blocking to false, but landed up with the same issue.
-Arijit ________________________________ From: Alexander Aristov <[email protected]> To: [email protected] Cc: arijit <[email protected]> Sent: Monday, July 2, 2012 9:17 PM Subject: Re: parsechecker fetches url but fetcher fails I remember there was a nutch property to behaive on robots.txt It can respect OR completely bypass robots.txt. Check nutch-site and nutch-default files. Best Regards Alexander Aristov On 2 July 2012 19:31, Julien Nioche <[email protected]> wrote: looks like a bug with the way the robots parser deals with URLs like this. >Please open a JIRA > > >On 2 July 2012 13:00, arijit <[email protected]> wrote: > >> Hi, >> Since learning that nutch will be unable to crawl the javascript >> function calls in href, I started looking for other alternatives. I decided >> to crawl http://en.wikipedia.org/wiki/Districts_of_India. >> I first tried injecting this URL and follow the step-by-step approach >> till fetcher - when I realized, nutch did not fetch anything from this >> website. I tried looking into logs/hadoop.log and found the following 3 >> lines - which I believe could be saying that nutch is unable to parse the >> robots.txt in the website and ttherefore, fetcher stopped? >> >> 2012-07-02 16:41:07,452 WARN api.RobotRulesParser - error parsing >> robots rules- can't decode path: /wiki/Wikipedia%3Mediation_Committee/ >> 2012-07-02 16:41:07,452 WARN api.RobotRulesParser - error parsing >> robots rules- can't decode path: /wiki/Wikipedia_talk%3Mediation_Committee/ >> 2012-07-02 16:41:07,452 WARN api.RobotRulesParser - error parsing >> robots rules- can't decode path: /wiki/Wikipedia%3Mediation_Cabal/Cases/ >> >> I tried checking the URL using parsechecker and no issues there! I >> think it means that the robots.txt is malformed for this website, which is >> preventing fetcher from fetching anything. Is there a way to get around >> this problem, as parsechecker seems to go on its merry way parsing. >> >> Just so that my "novice" logic does not come in the way of finding >> what is going wrong, I have attached my hadoop.log - which contains both >> the fetcher as well as parsechecker logs. >> >> Any help on this, is much appreciated. >> -Arijit >> > > > >-- >* >*Open Source Solutions for Text Engineering > >http://digitalpebble.blogspot.com/ >http://www.digitalpebble.com >http://twitter.com/digitalpebble >

