Hi, Since learning that nutch will be unable to crawl the javascript function calls in href, I started looking for other alternatives. I decided to crawl http://en.wikipedia.org/wiki/Districts_of_India. I first tried injecting this URL and follow the step-by-step approach till fetcher - when I realized, nutch did not fetch anything from this website. I tried looking into logs/hadoop.log and found the following 3 lines - which I believe could be saying that nutch is unable to parse the robots.txt in the website and ttherefore, fetcher stopped?
2012-07-02 16:41:07,452 WARN api.RobotRulesParser - error parsing robots rules- can't decode path: /wiki/Wikipedia%3Mediation_Committee/ 2012-07-02 16:41:07,452 WARN api.RobotRulesParser - error parsing robots rules- can't decode path: /wiki/Wikipedia_talk%3Mediation_Committee/ 2012-07-02 16:41:07,452 WARN api.RobotRulesParser - error parsing robots rules- can't decode path: /wiki/Wikipedia%3Mediation_Cabal/Cases/ I tried checking the URL using parsechecker and no issues there! I think it means that the robots.txt is malformed for this website, which is preventing fetcher from fetching anything. Is there a way to get around this problem, as parsechecker seems to go on its merry way parsing. Just so that my "novice" logic does not come in the way of finding what is going wrong, I have attached my hadoop.log - which contains both the fetcher as well as parsechecker logs. Any help on this, is much appreciated. -Arijit

