Hi,
   Since learning that nutch will be unable to crawl the javascript function 
calls in href, I started looking for other alternatives. I decided to crawl 
http://en.wikipedia.org/wiki/Districts_of_India.
    I first tried injecting this URL and follow the step-by-step approach till 
fetcher - when I realized, nutch did not fetch anything from this website. I 
tried looking into logs/hadoop.log and found the following 3 lines - which I 
believe could be saying that nutch is unable to parse the robots.txt in the 
website and ttherefore, fetcher stopped?

    

    2012-07-02 16:41:07,452 WARN  api.RobotRulesParser - error parsing robots 
rules- can't decode path: /wiki/Wikipedia%3Mediation_Committee/
    2012-07-02 16:41:07,452 WARN  api.RobotRulesParser - error parsing robots 
rules- can't decode path: /wiki/Wikipedia_talk%3Mediation_Committee/
    2012-07-02 16:41:07,452 WARN  api.RobotRulesParser - error parsing robots 
rules- can't decode path: /wiki/Wikipedia%3Mediation_Cabal/Cases/

    I tried checking the URL using parsechecker and no issues there! I think it 
means that the robots.txt is malformed for this website, which is preventing 
fetcher from fetching anything. Is there a way to get around this problem, as 
parsechecker seems to go on its merry way parsing.

    Just so that my "novice" logic does not come in the way of finding what is 
going wrong, I have attached my hadoop.log - which contains both the fetcher as 
well as parsechecker logs.

    Any help on this, is much appreciated.
-Arijit

Reply via email to