On Jul 2, 2012, at 5:00am, arijit wrote: > Hi, > Since learning that nutch will be unable to crawl the javascript function > calls in href, I started looking for other alternatives. I decided to crawl > http://en.wikipedia.org/wiki/Districts_of_India. > I first tried injecting this URL and follow the step-by-step approach > till fetcher - when I realized, nutch did not fetch anything from this > website. I tried looking into logs/hadoop.log and found the following 3 lines > - which I believe could be saying that nutch is unable to parse the > robots.txt in the website and ttherefore, fetcher stopped? > > 2012-07-02 16:41:07,452 WARN api.RobotRulesParser - error parsing robots > rules- can't decode path: /wiki/Wikipedia%3Mediation_Committee/ > 2012-07-02 16:41:07,452 WARN api.RobotRulesParser - error parsing robots > rules- can't decode path: /wiki/Wikipedia_talk%3Mediation_Committee/ > 2012-07-02 16:41:07,452 WARN api.RobotRulesParser - error parsing robots > rules- can't decode path: /wiki/Wikipedia%3Mediation_Cabal/Cases/
The issue is that the Wikipedia robots.txt file contains malformed URLs - these three are missing the 'A' from the %3A sequence. > I tried checking the URL using parsechecker and no issues there! I think > it means that the robots.txt is malformed for this website, which is > preventing fetcher from fetching anything. Is there a way to get around this > problem, as parsechecker seems to go on its merry way parsing. This is an example of where having Nutch use crawler-commons robots.txt parser would help :) https://issues.apache.org/jira/browse/NUTCH-1031 -- Ken -------------------------- Ken Krugler http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Mahout & Solr

