Linda Walsh <[EMAIL PROTECTED]> writes: [...] To answer the question raised in the subject: obviously, respecting the "robots" file does not imply (even jokingly) that Wget's operator is a robot, but that the program is an automated agent, aka crawler, which once set up, analyzes HTML and downloads content without the user's explicit intervention. For example, a browser requires you to click on a link to get it, whereas Wget will download any number of links without asking for each one.
In many circles respecting robots.txt is what distinguishes polite crawlers from impolite ones. The server's benefit is the ability to hide parts of the site from search engines or to keep crawlers away from dynamically generated CPU-intensive parts of the site. The crawler's benefit is that respecting robots.txt can keep it from being banned from the server and can also keep it out of URL "black holes" present on many sites with dynamic content and URL generation. It can be questioned how much the above is relevant to Wget, but those arguments were the original reasons for honoring robots.txt by default. Today I tend to view Wget as not so much of a real crawler, but more of a user-initiated download agent. For example, a browser downloads all images, frames, and style sheets referenced by a web page simply because the user clicked on a link or typed the URL in the location bar. It could be argued that Wget has the same "right" to download content that its operator instructed it to download without consulting robots.txt. > Regardless of wget being invoked by a human or an automated script, > do you think that it might be a "desirable" feature to have wget > display the reason why it isn't downloading files I for one agree that this would be desirable.
