RE: robots.txt, disallow: with empty string

Markus Jelsma Wed, 20 Jun 2012 15:47:31 -0700

If you're sure Nutch treats an empty string the same as / then please file an 
issue in Jira so we can track and fix it.
Thanks
 
 
-----Original message-----
> From:Magnús Skúlason <[email protected]>
> Sent: Wed 20-Jun-2012 18:36
> To: [email protected]
> Subject: robots.txt, disallow: with empty string
> 
> Hi,
> 
> I have noticed that my nutch crawler skips many sites with robots.txt
> files that look something like this:
> User-agent: *
> Disallow: /administrator/
> Disallow: /classes/
> Disallow: /components/
> Disallow: /editor/
> Disallow: /images/
> Disallow: /includes/
> Disallow: /language/
> Disallow: /mambots/
> Disallow: /media/
> Disallow: /modules/
> Disallow: /templates/
> Disallow: /uploadfiles/
> Disallow:
> 
> That is where the last line is "Disallow:", is nutch treating this as
> it should disallow all paths? I really don't think that is what the
> webmasters are intending and that this is probably a auto generation
> error in some web systems. If it where their intention to ban all
> crawlers it would be more straight forward to put a "Disallow: /" and
> skip all the prior rules.
> 
> best regards,
> Magnus
>

RE: robots.txt, disallow: with empty string

Reply via email to