Re: parsechecker fetches url but fetcher fails

arijit Mon, 02 Jul 2012 13:19:38 -0700

Alexander,
   As per the following JIRA: https://issues.apache.org/jira/browse/NUTCH-938, 
they have taken away the ability to ignore robots.txt using 
"protocol.plugin.check.robots". :(
   I tried setting protocol.plugin.check.robots and 
protocol.plugin.check.blocking to false, but landed up with the same issue.


-Arijit



________________________________
 From: Alexander Aristov <[email protected]>
To: [email protected] 
Cc: arijit <[email protected]> 
Sent: Monday, July 2, 2012 9:17 PM
Subject: Re: parsechecker fetches url but fetcher fails
 

I remember there was a nutch property to behaive on robots.txt

It can respect OR completely bypass robots.txt. Check nutch-site and 
nutch-default files.

Best Regards
Alexander Aristov



On 2 July 2012 19:31, Julien Nioche <[email protected]> wrote:

looks like a bug with the way the robots parser deals with URLs like this.
>Please open a JIRA
>
>
>On 2 July 2012 13:00, arijit <[email protected]> wrote:
>
>> Hi,
>>    Since learning that nutch will be unable to crawl the javascript
>> function calls in href, I started looking for other alternatives. I decided
>> to crawl http://en.wikipedia.org/wiki/Districts_of_India.
>>     I first tried injecting this URL and follow the step-by-step approach
>> till fetcher - when I realized, nutch did not fetch anything from this
>> website. I tried looking into logs/hadoop.log and found the following 3
>> lines - which I believe could be saying that nutch is unable to parse the
>> robots.txt in the website and ttherefore, fetcher stopped?
>>
>>     2012-07-02 16:41:07,452 WARN  api.RobotRulesParser - error parsing
>> robots rules- can't decode path: /wiki/Wikipedia%3Mediation_Committee/
>>     2012-07-02 16:41:07,452 WARN  api.RobotRulesParser - error parsing
>> robots rules- can't decode path: /wiki/Wikipedia_talk%3Mediation_Committee/
>>     2012-07-02 16:41:07,452 WARN  api.RobotRulesParser - error parsing
>> robots rules- can't decode path: /wiki/Wikipedia%3Mediation_Cabal/Cases/
>>
>>     I tried checking the URL using parsechecker and no issues there! I
>> think it means that the robots.txt is malformed for this website, which is
>> preventing fetcher from fetching anything. Is there a way to get around
>> this problem, as parsechecker seems to go on its merry way parsing.
>>
>>     Just so that my "novice" logic does not come in the way of finding
>> what is going wrong, I have attached my hadoop.log - which contains both
>> the fetcher as well as parsechecker logs.
>>
>>     Any help on this, is much appreciated.
>> -Arijit
>>
>
>
>
>--
>*
>*Open Source Solutions for Text Engineering
>
>http://digitalpebble.blogspot.com/
>http://www.digitalpebble.com
>http://twitter.com/digitalpebble
>

Re: parsechecker fetches url but fetcher fails

Reply via email to