RE: ignoring robots.txt

Tony Lewis Wed, 18 Jul 2007 11:36:46 -0700

Micah Cowan wrote:

> The manpage doesn't need to give as detailed explanations as the info
> manual (though, as it's auto-generated from the info manual, this could
> be hard to avoid); but it should fully describe essential features.


I can't see any good reason for one set of documentation to be different than 
another. Let the user choose whatever is comfortable. Some users may not even 
know they have a choice between man and info.

> While we're on the subject: should we explicitly warn about using such
> features as robots=off, and --user-agent? And what should those warnings
> be? Something like, "Use of this feature may help you download files
> from which wget would otherwise be blocked, but it's kind of sneaky, and
> web site administrators may get upset and block your IP address if they
> discover you using it"?

No, I don't think we should nor do I think use of those features is "sneaky".

With regard to robots.txt, people use it when they don't want *automated* 
spiders crawling through their sites. A well-crafted wget command that 
downloads selected information from a site without regard to the robots.txt 
restrictions is a very different situation. It's true that someone could 
--mirror the site while ignoring robots.txt, but even that is legitimate in 
many cases.

With regard to user agent, many websites customize their output based on the 
browser that is displaying the page. If one does not set user agent to match 
their browser, the retrieved content may be very different than what was 
displayed in the browser.

All that being said, it wouldn't hurt to have a section in the documentation on 
wget etiquette: think carefully about ignoring robots.txt, use --wait to 
throttle the download if it will be lengthy, etc.

Perhaps we can even add a --be-nice option similar to --mirror that adjusts 
options to match the etiquette suggestions.

Tony

RE: ignoring robots.txt

Reply via email to