I remember there was a nutch property to behaive on robots.txt

It can respect OR completely bypass robots.txt. Check nutch-site and
nutch-default files.

Best Regards
Alexander Aristov


On 2 July 2012 19:31, Julien Nioche <[email protected]> wrote:

> looks like a bug with the way the robots parser deals with URLs like this.
> Please open a JIRA
>
> On 2 July 2012 13:00, arijit <[email protected]> wrote:
>
> > Hi,
> >    Since learning that nutch will be unable to crawl the javascript
> > function calls in href, I started looking for other alternatives. I
> decided
> > to crawl http://en.wikipedia.org/wiki/Districts_of_India.
> >     I first tried injecting this URL and follow the step-by-step approach
> > till fetcher - when I realized, nutch did not fetch anything from this
> > website. I tried looking into logs/hadoop.log and found the following 3
> > lines - which I believe could be saying that nutch is unable to parse the
> > robots.txt in the website and ttherefore, fetcher stopped?
> >
> >     2012-07-02 16:41:07,452 WARN  api.RobotRulesParser - error parsing
> > robots rules- can't decode path: /wiki/Wikipedia%3Mediation_Committee/
> >     2012-07-02 16:41:07,452 WARN  api.RobotRulesParser - error parsing
> > robots rules- can't decode path:
> /wiki/Wikipedia_talk%3Mediation_Committee/
> >     2012-07-02 16:41:07,452 WARN  api.RobotRulesParser - error parsing
> > robots rules- can't decode path: /wiki/Wikipedia%3Mediation_Cabal/Cases/
> >
> >     I tried checking the URL using parsechecker and no issues there! I
> > think it means that the robots.txt is malformed for this website, which
> is
> > preventing fetcher from fetching anything. Is there a way to get around
> > this problem, as parsechecker seems to go on its merry way parsing.
> >
> >     Just so that my "novice" logic does not come in the way of finding
> > what is going wrong, I have attached my hadoop.log - which contains both
> > the fetcher as well as parsechecker logs.
> >
> >     Any help on this, is much appreciated.
> > -Arijit
> >
>
>
>
> --
> *
> *Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
> http://twitter.com/digitalpebble
>

Reply via email to