Re: Ignoring robots.txt [was Re: wget default behavior...]
> Tony Godshall wrote: > >> ... Perhaps it should be one of those things that one can do > >> oneself if one must but is generally frowned upon (like making a > >> version of wget that ignores robots.txt). > > > > Damn. I was only joking about ignoring robots.txt, but now I'm > > thinking[1] there may be good reasons to do so... maybe it should be > > in mainline wget. > > Actually, it is. "-e robots=off". :) > > This also turns off obedience to the "nofollow" attribute sometimes > found in meta and a tags. Ah, my ignorance is showing. I stand corrected.
Re: Ignoring robots.txt [was Re: wget default behavior...]
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 Tony Godshall wrote: >> ... Perhaps it should be one of those things that one can do >> oneself if one must but is generally frowned upon (like making a >> version of wget that ignores robots.txt). > > Damn. I was only joking about ignoring robots.txt, but now I'm > thinking[1] there may be good reasons to do so... maybe it should be > in mainline wget. Actually, it is. "-e robots=off". :) This also turns off obedience to the "nofollow" attribute sometimes found in meta and a tags. - -- Micah J. Cowan Programmer, musician, typesetting enthusiast, gamer... http://micah.cowan.name/ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFHFmaM7M8hyUobTrERCNYWAJ4zTyACcT2zTgjo4FnXG2R8F839PgCgjkbo 2IcWqVjV6Lgxvg7JLh+tjX4= =cYGA -END PGP SIGNATURE-
Re: Man pages [Re: ignoring robots.txt]
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 Christopher G. Lewis wrote: > Micah et al. - > > Just for an FYI - the whole texi->info, texi->html and > (texi->rtf->hlp) is *very* fragile in the windows world. You actually > have to download a *very* old version of makeinfo (1.68, not even on > available on http://www.gnu.org/software/texinfo/) that supports RTF > generation. > > Any progress that we take to work on this should look at a new texi->hlp > (or chm) process or abandon the HLP format completely. > > The HLP format is kind of nice since you don't get one large HTML > file, and has searching etc. But I believe there are issues w/ HLP > files on either x64 or Vista (can't recall off the top of my head). So > if it has to go away, so be it. Perhaps this is a good argument for using DocBook as the source format, as I believe openjade supports conversion to RTF. - -- Micah J. Cowan Programmer, musician, typesetting enthusiast, gamer... http://micah.cowan.name/ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFGoFLf7M8hyUobTrERCKChAJ46RSLahpPKO9kgZxuLrcuEX1cHbQCdEqwU dmGMzOA2MzwNoWYxQJjqV/s= =RkiP -END PGP SIGNATURE-
RE: Man pages [Re: ignoring robots.txt]
Micah et al. - Just for an FYI - the whole texi->info, texi->html and (texi->rtf->hlp) is *very* fragile in the windows world. You actually have to download a *very* old version of makeinfo (1.68, not even on available on http://www.gnu.org/software/texinfo/) that supports RTF generation. Any progress that we take to work on this should look at a new texi->hlp (or chm) process or abandon the HLP format completely. The HLP format is kind of nice since you don't get one large HTML file, and has searching etc. But I believe there are issues w/ HLP files on either x64 or Vista (can't recall off the top of my head). So if it has to go away, so be it. Christopher G. Lewis http://www.ChristopherLewis.com > -Original Message- > From: Micah Cowan [mailto:[EMAIL PROTECTED] > Sent: Thursday, July 19, 2007 1:16 PM > To: WGET@sunsite.dk > Subject: Man pages [Re: ignoring robots.txt] > > -BEGIN PGP SIGNED MESSAGE- > Hash: SHA256 > > Daniel Stenberg wrote: > > On Wed, 18 Jul 2007, Micah Cowan wrote: > > > >> The manpage doesn't need to give as detailed explanations > as the info > >> manual (though, as it's auto-generated from the info manual, this > >> could be hard to avoid); but it should fully describe > essential features. > > > > I know GNU projects for some reason go with info, but I'm > not in fan of > > that. > > > > Personally I always just use man pages and only revert to using info > > pages when forced. I simply don't like it when projects "hide" > > information in info pages. > > Well, the original intention, I think, is that the GNU > operating system > would use info as its primary documentation system, and avoid man > altogether. However, since in reality people just used GNU programs on > their own preexisting operating systems, which used nroff/man as their > primary documentation system, it was useful to provide man pages as > well. (AIUI.) > > Info is, IMO, a superior format to manpages (but only because that's > really not saying much). However, my fingers still type "man wget" > rather than "info wget" much more readily, for two reasons: > (1) because > only GNU programs tend to use Texinfo, whereas practically everything > (including GNU software) uses man pages, so it's far more > ubiquitous/habit-forming, and (2) I'm usually looking for a quick > reference, not an easy-reading manual: I'm pulling man up to type > "/something-or-other", which, for me, is easiest on an all-in-one > reference page, than in a separated-by-node info manual. > > However, when I'm actually looking to read up on a _subject_, rather > than an option or rc command, I'll use the Texinfo manual, > since that's > what it's better-suited for. > > Regardless of personal or group feelings about info, though, I pretty > much have to have documentation in Texinfo format, as it's expected of > GNU projects. However, Texinfo doesn't need to be the _source_ format; > and this discussion makes me toy with the prospect of switching to > DocBook XML. But I'm not sure I want to be rewriting the > manual at this > point. :p > > - -- > Micah J. Cowan > Programmer, musician, typesetting enthusiast, gamer... > http://micah.cowan.name/ > > -BEGIN PGP SIGNATURE- > Version: GnuPG v1.4.6 (GNU/Linux) > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org > > iD8DBQFGn6ps7M8hyUobTrERCIn5AKCAAk0/4ThESmTO82CYlfye+cNaKQCfVbJI > c/w+nbC8zasi0gS1VNkkETs= > =ZkQE > -END PGP SIGNATURE- >
Man pages [Re: ignoring robots.txt]
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 Daniel Stenberg wrote: > On Wed, 18 Jul 2007, Micah Cowan wrote: > >> The manpage doesn't need to give as detailed explanations as the info >> manual (though, as it's auto-generated from the info manual, this >> could be hard to avoid); but it should fully describe essential features. > > I know GNU projects for some reason go with info, but I'm not in fan of > that. > > Personally I always just use man pages and only revert to using info > pages when forced. I simply don't like it when projects "hide" > information in info pages. Well, the original intention, I think, is that the GNU operating system would use info as its primary documentation system, and avoid man altogether. However, since in reality people just used GNU programs on their own preexisting operating systems, which used nroff/man as their primary documentation system, it was useful to provide man pages as well. (AIUI.) Info is, IMO, a superior format to manpages (but only because that's really not saying much). However, my fingers still type "man wget" rather than "info wget" much more readily, for two reasons: (1) because only GNU programs tend to use Texinfo, whereas practically everything (including GNU software) uses man pages, so it's far more ubiquitous/habit-forming, and (2) I'm usually looking for a quick reference, not an easy-reading manual: I'm pulling man up to type "/something-or-other", which, for me, is easiest on an all-in-one reference page, than in a separated-by-node info manual. However, when I'm actually looking to read up on a _subject_, rather than an option or rc command, I'll use the Texinfo manual, since that's what it's better-suited for. Regardless of personal or group feelings about info, though, I pretty much have to have documentation in Texinfo format, as it's expected of GNU projects. However, Texinfo doesn't need to be the _source_ format; and this discussion makes me toy with the prospect of switching to DocBook XML. But I'm not sure I want to be rewriting the manual at this point. :p - -- Micah J. Cowan Programmer, musician, typesetting enthusiast, gamer... http://micah.cowan.name/ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFGn6ps7M8hyUobTrERCIn5AKCAAk0/4ThESmTO82CYlfye+cNaKQCfVbJI c/w+nbC8zasi0gS1VNkkETs= =ZkQE -END PGP SIGNATURE-
Re: ignoring robots.txt
Daniel Stenberg wrote: On Wed, 18 Jul 2007, Micah Cowan wrote: The manpage doesn't need to give as detailed explanations as the info manual (though, as it's auto-generated from the info manual, this could be hard to avoid); but it should fully describe essential features. I know GNU projects for some reason go with info, but I'm not in fan of that. Personally I always just use man pages and only revert to using info pages when forced. I simply don't like it when projects "hide" information in info pages. Being a FreeBSD user since a couple of years back, reading this thread is the first time I've stumbled upon 'info', actually. -- Andreas
Re: ignoring robots.txt
On Wed, 18 Jul 2007, Micah Cowan wrote: The manpage doesn't need to give as detailed explanations as the info manual (though, as it's auto-generated from the info manual, this could be hard to avoid); but it should fully describe essential features. I know GNU projects for some reason go with info, but I'm not in fan of that. Personally I always just use man pages and only revert to using info pages when forced. I simply don't like it when projects "hide" information in info pages.
Re: Man pages [Re: ignoring robots.txt]
Micah Cowan <[EMAIL PROTECTED]> writes: >> Converting from Info to man is harder than it may seem. The script >> that does it now is basically a hack that doesn't really work well >> even for the small part of the manual that it tries to cover. > > I'd noticed. :) > > I haven't looked at the script that does this work; I had assumed > that it was some standard tool for this task, but perhaps it's > something more custom? Our `texi2pod' comes from GCC, and it would seem that it was written for GCC/binutils. The version in the latest binutils is almost identical to what Wget ships, plus a bug fix or two. Given its state and capabilities, I doubt that it is widely used, so unfortunately it's pretty far from being a standard tool. I would have much preferred to use a standard tool, but as far as I knew none was available at the time. In fact, I'm not aware of one now, either. >> As for the "stub" man page... Debian for one finds it unacceptable, >> and I can kind of understand why. > > Yeah, especially since they're frequently forced to leave out the > "authoritative" manual. This issue predates the GFDL debacle by several years, but yes, if anything, things have gotten worse in that department. Much worse.
Man pages [Re: ignoring robots.txt]
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 Hrvoje Niksic wrote: > Micah Cowan <[EMAIL PROTECTED]> writes: > >> I think we should either be a "stub", or a fairly complete "manual" >> (and agree that the latter seems preferable); nothing half-way >> between: what we have now is a fairly incomplete manual. > > Converting from Info to man is harder than it may seem. The script > that does it now is basically a hack that doesn't really work well > even for the small part of the manual that it tries to cover. I'd noticed. :) I haven't looked at the script that does this work; I had assumed that it was some standard tool for this task, but perhaps it's something more custom? > What makes it harder is the impedance mismatch between Texinfo and > Unix manual philosophies. What is appropriate for a GNU manual, for > example tutorial-style nodes, a longish FAQ section, or the inclusion > of the entire software license, would be completely out of place in a > man page. (This is a consequence of Info being hyperlinked, which > means that it's easier to skip the nodes one is not interested in, at > least in theory.) On the other hand, information crucial to any man > page, such as clearly delimited sections that include SYNOPSIS, > DESCRIPTION, FILES or SEE ALSO, might not be found in a Texinfo > document at all, at least not in an easily recognizable and > extractable form. Right; by "complete manual", I didn't mean to include such things as FAQ sections, etc. But yes, it means that one can't simply directly translate TeXinfo docs into its exact equivalent in *roff. > As for the "stub" man page... Debian for one finds it unacceptable, > and I can kind of understand why. Yeah, especially since they're frequently forced to leave out the "authoritative" manual. > When the Debian maintainer stepped down, I agreed with his successor > to a compromise solution: that a man page would be automatically > generated from the Info documentation which would contain at least a > fairly complete list of command-line options. It was far from > perfect, but it was still better than nothing, and it was deemed Good > Enough. Note that I'm not saying the current solution is good enough > -- it isn't. I'm just providing a history of how the current state of > the affairs came to be. And thanks very much for that; it has been very informative. - -- Micah J. Cowan Programmer, musician, typesetting enthusiast, gamer... http://micah.cowan.name/ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD4DBQFGnor27M8hyUobTrERCACzAJYgEZydf/ESX6rCjfYjY76jdNyIAJwPSPZ6 mom+r7VqREv5gGJaSSgQPw== =SGgY -END PGP SIGNATURE-
Re: ignoring robots.txt
Micah Cowan <[EMAIL PROTECTED]> writes: > I think we should either be a "stub", or a fairly complete "manual" > (and agree that the latter seems preferable); nothing half-way > between: what we have now is a fairly incomplete manual. Converting from Info to man is harder than it may seem. The script that does it now is basically a hack that doesn't really work well even for the small part of the manual that it tries to cover. What makes it harder is the impedance mismatch between Texinfo and Unix manual philosophies. What is appropriate for a GNU manual, for example tutorial-style nodes, a longish FAQ section, or the inclusion of the entire software license, would be completely out of place in a man page. (This is a consequence of Info being hyperlinked, which means that it's easier to skip the nodes one is not interested in, at least in theory.) On the other hand, information crucial to any man page, such as clearly delimited sections that include SYNOPSIS, DESCRIPTION, FILES or SEE ALSO, might not be found in a Texinfo document at all, at least not in an easily recognizable and extractable form. As for the "stub" man page... Debian for one finds it unacceptable, and I can kind of understand why. When I pulled the man page out of the distribution, Debian's solution was to keep maintaining the old man page and disttributing it with their package. As a result, any Debian user who issued `man wget' would read Debian-maintained man page and was at the mercy of the Debian maintainer to have ensured that the man page was updated as new features arrived. Since most Unix users only read the man page and never bother with Info, this was suboptimal -- a crucial piece of documentation was not inherited from the project, but produced by Debian. (I further didn't like that the maintainer used my original man page even though I explicitly asked them not to, but that's another matter.) When the Debian maintainer stepped down, I agreed with his successor to a compromise solution: that a man page would be automatically generated from the Info documentation which would contain at least a fairly complete list of command-line options. It was far from perfect, but it was still better than nothing, and it was deemed Good Enough. Note that I'm not saying the current solution is good enough -- it isn't. I'm just providing a history of how the current state of the affairs came to be.
Re: ignoring robots.txt
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 Tony Lewis wrote: > Micah Cowan wrote: > >> Don't we already follow typical etiquette by default? Or do you >> mean that to override non-default settings in the rcfile or >> whatnot? > > We don't automatically use a --wait time between requests. I'm not > sure what other "nice" options we'd want to make easily available, > but there are probably more. I suppose either of --wait or --limit-rate would be useful; but if we decide that something is "nice", we should probably do it by default, as very, very few users will invoke them explicitly, just to be "nice". - -- Micah J. Cowan Programmer, musician, typesetting enthusiast, gamer... http://micah.cowan.name/ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFGnnC27M8hyUobTrERCEZdAKCBCOWnkbsx3vPKILbuYBJkbaCGwwCgjZ2g VSLwsxEgilKl4R5auw7WVLg= =YCGV -END PGP SIGNATURE-
RE: ignoring robots.txt
Micah Cowan wrote: > Don't we already follow typical etiquette by default? Or do you mean > that to override non-default settings in the rcfile or whatnot? We don't automatically use a --wait time between requests. I'm not sure what other "nice" options we'd want to make easily available, but there are probably more. Tony
Re: ignoring robots.txt
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 Tony Lewis wrote: > Micah Cowan wrote: > >> The manpage doesn't need to give as detailed explanations as the >> info manual (though, as it's auto-generated from the info manual, >> this could be hard to avoid); but it should fully describe >> essential features. > > I can't see any good reason for one set of documentation to be > different than another. Let the user choose whatever is comfortable. > Some users may not even know they have a choice between man and info. It's mentioned in the manpage, though not nearly as strongly as is typical for GNU projects. GNU often has "stub" manpages, which say something along the lines of: The full documentation for foo is maintained as a Texinfo manual... and describe how to invoke info. If, for some reason, we were to decide that we shouldn't have all the same info in the manpage as exists in the info manual, we should at least be calling out the fact that much more information, including a variety of very useful rc commands, are detailed in the info document. I think we should either be a "stub", or a fairly complete "manual" (and agree that the latter seems preferable); nothing half-way between: what we have now is a fairly incomplete manual. >> While we're on the subject: should we explicitly warn about using >> such features as robots=off, and --user-agent? And what should >> those warnings be? Something like, "Use of this feature may help >> you download files from which wget would otherwise be blocked, but >> it's kind of sneaky, and web site administrators may get upset and >> block your IP address if they discover you using it"? > > No, I don't think we should nor do I think use of those features is > "sneaky". > > With regard to robots.txt, people use it when they don't want > *automated* spiders crawling through their sites. A well-crafted wget > command that downloads selected information from a site without > regard to the robots.txt restrictions is a very different situation. > It's true that someone could --mirror the site while ignoring > robots.txt, but even that is legitimate in many cases. > > With regard to user agent, many websites customize their output based > on the browser that is displaying the page. If one does not set user > agent to match their browser, the retrieved content may be very > different than what was displayed in the browser. Yes, but I meant with specific intent to get around website restrictions. Certain sites (image galleries, for instance) often specifically want to force users to access their resources via the web, and do not wish to allow users to mass-download their resources for later offline perusal: they want to force the users to come back each time to use them--especially if the site requires a subscription of some sort (e.g., porn), or their ad revenue is directly tied to some "Top 100" list and they want to force you to vote (warez, roms). Of course, if you're downloading warez, the concept of "sneaky" wget options probably doesn't concern you overly much! :) Whether getting around such restrictions with --user-agent and -e robots=no is "sneaky" is debatable, when you're legitimately accessing content that you could straightforwardly obtain with your web browser (where "legitimate" in this context probably means you're subscribed to the image gallery, or own the physical counterparts to the roms you're downloading, or what have you), but you'll almost certainly be banned from the site if you happen to be discovered. Perhaps this is more FAQ territory than manual territory at any rate: I was thinking of crafting a FAQ entry for dealing with such issues (no-follow, especially, seems to trip users up), but I want to craft it carefully, with ample warnings about what you're doing. :) > All that being said, it wouldn't hurt to have a section in the > documentation on wget etiquette: think carefully about ignoring > robots.txt, use --wait to throttle the download if it will be > lengthy, etc. I think that may be wise. > Perhaps we can even add a --be-nice option similar to --mirror that > adjusts options to match the etiquette suggestions. Don't we already follow typical etiquette by default? Or do you mean that to override non-default settings in the rcfile or whatnot? - -- Micah J. Cowan Programmer, musician, typesetting enthusiast, gamer... http://micah.cowan.name/ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFGnmul7M8hyUobTrERCFjfAJ4i9bj23B3JdT7BfVdpdGigxU5TFgCfSqPX SxIOUXW1wNogbRdU2BfwWBI= =n+Uw -END PGP SIGNATURE-
RE: ignoring robots.txt
Micah Cowan wrote: > The manpage doesn't need to give as detailed explanations as the info > manual (though, as it's auto-generated from the info manual, this could > be hard to avoid); but it should fully describe essential features. I can't see any good reason for one set of documentation to be different than another. Let the user choose whatever is comfortable. Some users may not even know they have a choice between man and info. > While we're on the subject: should we explicitly warn about using such > features as robots=off, and --user-agent? And what should those warnings > be? Something like, "Use of this feature may help you download files > from which wget would otherwise be blocked, but it's kind of sneaky, and > web site administrators may get upset and block your IP address if they > discover you using it"? No, I don't think we should nor do I think use of those features is "sneaky". With regard to robots.txt, people use it when they don't want *automated* spiders crawling through their sites. A well-crafted wget command that downloads selected information from a site without regard to the robots.txt restrictions is a very different situation. It's true that someone could --mirror the site while ignoring robots.txt, but even that is legitimate in many cases. With regard to user agent, many websites customize their output based on the browser that is displaying the page. If one does not set user agent to match their browser, the retrieved content may be very different than what was displayed in the browser. All that being said, it wouldn't hurt to have a section in the documentation on wget etiquette: think carefully about ignoring robots.txt, use --wait to throttle the download if it will be lengthy, etc. Perhaps we can even add a --be-nice option similar to --mirror that adjusts options to match the etiquette suggestions. Tony
Re: ignoring robots.txt
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 Steven M. Schweda wrote: > From: Josh Williams > >> As far as I can tell, there's nothing in the man page about it. > >It's pretty well hidden. > > -e robots=off > > At this point, I normally just grind my teeth instead of complaining > about the differences between the command-line options and the commands > in the ".wgetrc" start-up file. The man page, AFAICT, doesn't list any of the rc commands, or anything much at all abou the rc file. This should probably be remedied, as this information is important enough, and people often are in the habit of typing "man wget" instead of "info wget", to say nothing of those distributions which might distribute the info manual in a separate package, due to, say, DFSG issues (which are resolved at the moment in the trunk). The manpage doesn't need to give as detailed explanations as the info manual (though, as it's auto-generated from the info manual, this could be hard to avoid); but it should fully describe essential features. While we're on the subject: should we explicitly warn about using such features as robots=off, and --user-agent? And what should those warnings be? Something like, "Use of this feature may help you download files from which wget would otherwise be blocked, but it's kind of sneaky, and web site administrators may get upset and block your IP address if they discover you using it"? - -- Micah J. Cowan Programmer, musician, typesetting enthusiast, gamer... http://micah.cowan.name/ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFGnllo7M8hyUobTrERCL9FAJ9+PucAr/feeyObcOJMn9HZLk+xRACeNnB/ YfLYyzpHQHqkCGPTPbJih5M= =SaZw -END PGP SIGNATURE-
Re: ignoring robots.txt
From: Josh Williams > As far as I can tell, there's nothing in the man page about it. It's pretty well hidden. -e robots=off At this point, I normally just grind my teeth instead of complaining about the differences between the command-line options and the commands in the ".wgetrc" start-up file. Steven M. Schweda [EMAIL PROTECTED] 382 South Warwick Street(+1) 651-699-9818 Saint Paul MN 55105-2547
Re: ignoring robots.txt
On 7/18/07, Maciej W. Rozycki <[EMAIL PROTECTED]> wrote: There is no particular reason, so we do. As far as I can tell, there's nothing in the man page about it.
Re: ignoring robots.txt
On Wed, 18 Jul 2007, Josh Williams wrote: > Is there any particular reason we don't have an option to ignore robots.txt? There is no particular reason, so we do. Maciej