I have created NUTCH-1418 for the same.
________________________________ From: Julien Nioche <[email protected]> To: [email protected]; arijit <[email protected]> Sent: Monday, July 2, 2012 9:01 PM Subject: Re: parsechecker fetches url but fetcher fails looks like a bug with the way the robots parser deals with URLs like this. Please open a JIRA On 2 July 2012 13:00, arijit <[email protected]> wrote: Hi, > Since learning that nutch will be unable to crawl the javascript function >calls in href, I started looking for other alternatives. I decided to crawl >http://en.wikipedia.org/wiki/Districts_of_India. > I first tried injecting this URL and follow the step-by-step approach till >fetcher - when I realized, nutch did not fetch anything from this website. I >tried looking into logs/hadoop.log and found the following 3 lines - which I >believe could be saying that nutch is unable to parse the robots.txt in the >website and ttherefore, fetcher stopped? > > > > 2012-07-02 16:41:07,452 WARN api.RobotRulesParser - error parsing robots >rules- can't decode path: /wiki/Wikipedia%3Mediation_Committee/ > 2012-07-02 16:41:07,452 WARN api.RobotRulesParser - error parsing robots rules- can't decode path: /wiki/Wikipedia_talk%3Mediation_Committee/ > 2012-07-02 16:41:07,452 WARN api.RobotRulesParser - error parsing robots >rules- can't decode path: /wiki/Wikipedia%3Mediation_Cabal/Cases/ > > > I tried checking the URL using parsechecker and no issues there! I think >it means that the robots.txt is malformed for this website, which is >preventing fetcher from fetching anything. Is there a way to get around this >problem, as parsechecker seems to go on its merry way parsing. > > > Just so that my "novice" logic does not come in the way of finding what is >going wrong, I have attached my hadoop.log - which contains both the fetcher >as well as parsechecker logs. > > > Any help on this, is much appreciated.-Arijit -- Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble

