Ignoring robots.txt [was Re: wget default behavior...]

2007-10-17 Thread Tony Godshall
... Perhaps it should be one of those things that one can do oneself if one must but is generally frowned upon (like making a version of wget that ignores robots.txt). Damn. I was only joking about ignoring robots.txt, but now I'm thinking[1] there may be good reasons to do so... maybe

Re: Ignoring robots.txt [was Re: wget default behavior...]

2007-10-17 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 Tony Godshall wrote: ... Perhaps it should be one of those things that one can do oneself if one must but is generally frowned upon (like making a version of wget that ignores robots.txt). Damn. I was only joking about ignoring robots.txt

Re: Ignoring robots.txt [was Re: wget default behavior...]

2007-10-17 Thread Tony Godshall
Tony Godshall wrote: ... Perhaps it should be one of those things that one can do oneself if one must but is generally frowned upon (like making a version of wget that ignores robots.txt). Damn. I was only joking about ignoring robots.txt, but now I'm thinking[1] there may be good

Re: Man pages [Re: ignoring robots.txt]

2007-07-20 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 Christopher G. Lewis wrote: Micah et al. - Just for an FYI - the whole texi-info, texi-html and (texi-rtf-hlp) is *very* fragile in the windows world. You actually have to download a *very* old version of makeinfo (1.68, not even on

Re: ignoring robots.txt

2007-07-19 Thread Daniel Stenberg
On Wed, 18 Jul 2007, Micah Cowan wrote: The manpage doesn't need to give as detailed explanations as the info manual (though, as it's auto-generated from the info manual, this could be hard to avoid); but it should fully describe essential features. I know GNU projects for some reason go

Re: ignoring robots.txt

2007-07-19 Thread Andreas Pettersson
Daniel Stenberg wrote: On Wed, 18 Jul 2007, Micah Cowan wrote: The manpage doesn't need to give as detailed explanations as the info manual (though, as it's auto-generated from the info manual, this could be hard to avoid); but it should fully describe essential features. I know GNU

Man pages [Re: ignoring robots.txt]

2007-07-19 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 Daniel Stenberg wrote: On Wed, 18 Jul 2007, Micah Cowan wrote: The manpage doesn't need to give as detailed explanations as the info manual (though, as it's auto-generated from the info manual, this could be hard to avoid); but it should fully

RE: Man pages [Re: ignoring robots.txt]

2007-07-19 Thread Christopher G. Lewis
recall off the top of my head). So if it has to go away, so be it. Christopher G. Lewis http://www.ChristopherLewis.com -Original Message- From: Micah Cowan [mailto:[EMAIL PROTECTED] Sent: Thursday, July 19, 2007 1:16 PM To: WGET@sunsite.dk Subject: Man pages [Re: ignoring robots.txt

Re: ignoring robots.txt

2007-07-18 Thread Maciej W. Rozycki
On Wed, 18 Jul 2007, Josh Williams wrote: Is there any particular reason we don't have an option to ignore robots.txt? There is no particular reason, so we do. Maciej

Re: ignoring robots.txt

2007-07-18 Thread Josh Williams
On 7/18/07, Maciej W. Rozycki [EMAIL PROTECTED] wrote: There is no particular reason, so we do. As far as I can tell, there's nothing in the man page about it.

Re: ignoring robots.txt

2007-07-18 Thread Steven M. Schweda
From: Josh Williams As far as I can tell, there's nothing in the man page about it. It's pretty well hidden. -e robots=off At this point, I normally just grind my teeth instead of complaining about the differences between the command-line options and the commands in the .wgetrc

Re: ignoring robots.txt

2007-07-18 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 Steven M. Schweda wrote: From: Josh Williams As far as I can tell, there's nothing in the man page about it. It's pretty well hidden. -e robots=off At this point, I normally just grind my teeth instead of complaining about the

RE: ignoring robots.txt

2007-07-18 Thread Tony Lewis
could --mirror the site while ignoring robots.txt, but even that is legitimate in many cases. With regard to user agent, many websites customize their output based on the browser that is displaying the page. If one does not set user agent to match their browser, the retrieved content may be very

Re: ignoring robots.txt

2007-07-18 Thread Micah Cowan
* spiders crawling through their sites. A well-crafted wget command that downloads selected information from a site without regard to the robots.txt restrictions is a very different situation. It's true that someone could --mirror the site while ignoring robots.txt, but even that is legitimate in many

RE: ignoring robots.txt

2007-07-18 Thread Tony Lewis
Micah Cowan wrote: Don't we already follow typical etiquette by default? Or do you mean that to override non-default settings in the rcfile or whatnot? We don't automatically use a --wait time between requests. I'm not sure what other nice options we'd want to make easily available, but there

Re: ignoring robots.txt

2007-07-18 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 Tony Lewis wrote: Micah Cowan wrote: Don't we already follow typical etiquette by default? Or do you mean that to override non-default settings in the rcfile or whatnot? We don't automatically use a --wait time between requests. I'm not

Re: ignoring robots.txt

2007-07-18 Thread Hrvoje Niksic
Micah Cowan [EMAIL PROTECTED] writes: I think we should either be a stub, or a fairly complete manual (and agree that the latter seems preferable); nothing half-way between: what we have now is a fairly incomplete manual. Converting from Info to man is harder than it may seem. The script

Man pages [Re: ignoring robots.txt]

2007-07-18 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 Hrvoje Niksic wrote: Micah Cowan [EMAIL PROTECTED] writes: I think we should either be a stub, or a fairly complete manual (and agree that the latter seems preferable); nothing half-way between: what we have now is a fairly incomplete manual.

Re: Man pages [Re: ignoring robots.txt]

2007-07-18 Thread Hrvoje Niksic
Micah Cowan [EMAIL PROTECTED] writes: Converting from Info to man is harder than it may seem. The script that does it now is basically a hack that doesn't really work well even for the small part of the manual that it tries to cover. I'd noticed. :) I haven't looked at the script that