Hi,

I have noticed that my nutch crawler skips many sites with robots.txt
files that look something like this:
User-agent: *
Disallow: /administrator/
Disallow: /classes/
Disallow: /components/
Disallow: /editor/
Disallow: /images/
Disallow: /includes/
Disallow: /language/
Disallow: /mambots/
Disallow: /media/
Disallow: /modules/
Disallow: /templates/
Disallow: /uploadfiles/
Disallow:

That is where the last line is "Disallow:", is nutch treating this as
it should disallow all paths? I really don't think that is what the
webmasters are intending and that this is probably a auto generation
error in some web systems. If it where their intention to ban all
crawlers it would be more straight forward to put a "Disallow: /" and
skip all the prior rules.

best regards,
Magnus

Reply via email to