I remember there was a nutch property to behaive on robots.txt It can respect OR completely bypass robots.txt. Check nutch-site and nutch-default files.
Best Regards Alexander Aristov On 2 July 2012 19:31, Julien Nioche <[email protected]> wrote: > looks like a bug with the way the robots parser deals with URLs like this. > Please open a JIRA > > On 2 July 2012 13:00, arijit <[email protected]> wrote: > > > Hi, > > Since learning that nutch will be unable to crawl the javascript > > function calls in href, I started looking for other alternatives. I > decided > > to crawl http://en.wikipedia.org/wiki/Districts_of_India. > > I first tried injecting this URL and follow the step-by-step approach > > till fetcher - when I realized, nutch did not fetch anything from this > > website. I tried looking into logs/hadoop.log and found the following 3 > > lines - which I believe could be saying that nutch is unable to parse the > > robots.txt in the website and ttherefore, fetcher stopped? > > > > 2012-07-02 16:41:07,452 WARN api.RobotRulesParser - error parsing > > robots rules- can't decode path: /wiki/Wikipedia%3Mediation_Committee/ > > 2012-07-02 16:41:07,452 WARN api.RobotRulesParser - error parsing > > robots rules- can't decode path: > /wiki/Wikipedia_talk%3Mediation_Committee/ > > 2012-07-02 16:41:07,452 WARN api.RobotRulesParser - error parsing > > robots rules- can't decode path: /wiki/Wikipedia%3Mediation_Cabal/Cases/ > > > > I tried checking the URL using parsechecker and no issues there! I > > think it means that the robots.txt is malformed for this website, which > is > > preventing fetcher from fetching anything. Is there a way to get around > > this problem, as parsechecker seems to go on its merry way parsing. > > > > Just so that my "novice" logic does not come in the way of finding > > what is going wrong, I have attached my hadoop.log - which contains both > > the fetcher as well as parsechecker logs. > > > > Any help on this, is much appreciated. > > -Arijit > > > > > > -- > * > *Open Source Solutions for Text Engineering > > http://digitalpebble.blogspot.com/ > http://www.digitalpebble.com > http://twitter.com/digitalpebble >

