Hi, I have noticed that my nutch crawler skips many sites with robots.txt files that look something like this: User-agent: * Disallow: /administrator/ Disallow: /classes/ Disallow: /components/ Disallow: /editor/ Disallow: /images/ Disallow: /includes/ Disallow: /language/ Disallow: /mambots/ Disallow: /media/ Disallow: /modules/ Disallow: /templates/ Disallow: /uploadfiles/ Disallow:
That is where the last line is "Disallow:", is nutch treating this as it should disallow all paths? I really don't think that is what the webmasters are intending and that this is probably a auto generation error in some web systems. If it where their intention to ban all crawlers it would be more straight forward to put a "Disallow: /" and skip all the prior rules. best regards, Magnus

