I have created NUTCH-1418 for the same.


________________________________
 From: Julien Nioche <[email protected]>
To: [email protected]; arijit <[email protected]> 
Sent: Monday, July 2, 2012 9:01 PM
Subject: Re: parsechecker fetches url but fetcher fails
 

looks like a bug with the way the robots parser deals with URLs like this. 
Please open a JIRA


On 2 July 2012 13:00, arijit <[email protected]> wrote:

Hi,
>   Since learning that nutch will be unable to crawl the javascript function 
>calls in href, I started looking for other alternatives. I decided to crawl 
>http://en.wikipedia.org/wiki/Districts_of_India.
>    I first tried injecting this URL and follow the step-by-step approach till 
>fetcher - when I realized, nutch did not fetch anything from this website. I 
>tried looking into logs/hadoop.log and found the following 3 lines - which I 
>believe could be saying that nutch is unable to parse the robots.txt in the 
>website and ttherefore, fetcher stopped?
>
>    
>
>    2012-07-02 16:41:07,452 WARN  api.RobotRulesParser - error parsing robots 
>rules- can't decode path: /wiki/Wikipedia%3Mediation_Committee/
>   
 2012-07-02 16:41:07,452 WARN  api.RobotRulesParser - error parsing robots 
rules- can't decode path: /wiki/Wikipedia_talk%3Mediation_Committee/
>    2012-07-02 16:41:07,452 WARN  api.RobotRulesParser - error parsing robots 
>rules- can't decode path: /wiki/Wikipedia%3Mediation_Cabal/Cases/
>
>
>    I tried checking the URL using parsechecker and no issues there! I think 
>it means that the robots.txt is malformed for this website, which is 
>preventing fetcher from fetching anything. Is there a way to get around this 
>problem, as parsechecker seems to go on its merry way parsing.
>
>
>    Just so that my "novice" logic does not come in the way of finding what is 
>going wrong, I have attached my hadoop.log - which contains both the fetcher 
>as well as parsechecker logs.
>
>
>    Any help on this, is much appreciated.-Arijit


-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Reply via email to