Linda Walsh <[EMAIL PROTECTED]> writes:
[...]

To answer the question raised in the subject: obviously, respecting
the "robots" file does not imply (even jokingly) that Wget's operator
is a robot, but that the program is an automated agent, aka crawler,
which once set up, analyzes HTML and downloads content without the
user's explicit intervention.  For example, a browser requires you to
click on a link to get it, whereas Wget will download any number of
links without asking for each one.

In many circles respecting robots.txt is what distinguishes polite
crawlers from impolite ones.  The server's benefit is the ability to
hide parts of the site from search engines or to keep crawlers away
from dynamically generated CPU-intensive parts of the site.  The
crawler's benefit is that respecting robots.txt can keep it from being
banned from the server and can also keep it out of URL "black holes"
present on many sites with dynamic content and URL generation.

It can be questioned how much the above is relevant to Wget, but those
arguments were the original reasons for honoring robots.txt by
default.  Today I tend to view Wget as not so much of a real crawler,
but more of a user-initiated download agent.  For example, a browser
downloads all images, frames, and style sheets referenced by a web
page simply because the user clicked on a link or typed the URL in the
location bar.  It could be argued that Wget has the same "right" to
download content that its operator instructed it to download without
consulting robots.txt.

> Regardless of wget being invoked by a human or an automated script,
> do you think that it might be a "desirable" feature to have wget
> display the reason why it isn't downloading files

I for one agree that this would be desirable.

Reply via email to