subject:"RE\: ignoring robots.txt"

Re: Ignoring robots.txt [was Re: wget default behavior...]

2007-10-17 Thread Tony Godshall

> Tony Godshall wrote:
> >> ... Perhaps it should be one of those things that one can do
> >> oneself if one must but is generally frowned upon (like making a
> >> version of wget that ignores robots.txt).
> >
> > Damn.  I was only joking about ignoring robots.txt, but now I'm
> > thinking[1] there may be good reasons to do so...  maybe it should be
> > in mainline wget.
>
> Actually, it is. "-e robots=off". :)
>
> This also turns off obedience to the "nofollow" attribute sometimes
> found in meta and a tags.

Ah, my ignorance is showing.

I stand corrected.

Re: Ignoring robots.txt [was Re: wget default behavior...]

2007-10-17 Thread Micah Cowan

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

Tony Godshall wrote:
>> ... Perhaps it should be one of those things that one can do
>> oneself if one must but is generally frowned upon (like making a
>> version of wget that ignores robots.txt).
> 
> Damn.  I was only joking about ignoring robots.txt, but now I'm
> thinking[1] there may be good reasons to do so...  maybe it should be
> in mainline wget.

Actually, it is. "-e robots=off". :)

This also turns off obedience to the "nofollow" attribute sometimes
found in meta and a tags.

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer...
http://micah.cowan.name/

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHFmaM7M8hyUobTrERCNYWAJ4zTyACcT2zTgjo4FnXG2R8F839PgCgjkbo
2IcWqVjV6Lgxvg7JLh+tjX4=
=cYGA
-END PGP SIGNATURE-

Re: Man pages [Re: ignoring robots.txt]

2007-07-19 Thread Micah Cowan

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

Christopher G. Lewis wrote:
> Micah et al. - 
> 
>   Just for an FYI - the whole texi->info, texi->html and
> (texi->rtf->hlp) is *very* fragile in the windows world.  You actually
> have to download a *very* old version of makeinfo (1.68, not even on
> available on http://www.gnu.org/software/texinfo/) that supports RTF
> generation.
> 
> Any progress that we take to work on this should look at a new texi->hlp
> (or chm) process or abandon the HLP format completely.  
> 
>   The HLP format is kind of nice since you don't get one large HTML
> file, and has searching etc.  But I believe there are issues w/ HLP
> files on either x64 or Vista (can't recall off the top of my head).  So
> if it has to go away, so be it.

Perhaps this is a good argument for using DocBook as the source format,
as I believe openjade supports conversion to RTF.

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer...
http://micah.cowan.name/

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFGoFLf7M8hyUobTrERCKChAJ46RSLahpPKO9kgZxuLrcuEX1cHbQCdEqwU
dmGMzOA2MzwNoWYxQJjqV/s=
=RkiP
-END PGP SIGNATURE-

RE: Man pages [Re: ignoring robots.txt]

2007-07-19 Thread Christopher G. Lewis

Micah et al. - 

  Just for an FYI - the whole texi->info, texi->html and
(texi->rtf->hlp) is *very* fragile in the windows world.  You actually
have to download a *very* old version of makeinfo (1.68, not even on
available on http://www.gnu.org/software/texinfo/) that supports RTF
generation.

Any progress that we take to work on this should look at a new texi->hlp
(or chm) process or abandon the HLP format completely.  

  The HLP format is kind of nice since you don't get one large HTML
file, and has searching etc.  But I believe there are issues w/ HLP
files on either x64 or Vista (can't recall off the top of my head).  So
if it has to go away, so be it.


Christopher G. Lewis
http://www.ChristopherLewis.com
 

> -Original Message-
> From: Micah Cowan [mailto:[EMAIL PROTECTED] 
> Sent: Thursday, July 19, 2007 1:16 PM
> To: WGET@sunsite.dk
> Subject: Man pages [Re: ignoring robots.txt]
> 
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA256
> 
> Daniel Stenberg wrote:
> > On Wed, 18 Jul 2007, Micah Cowan wrote:
> > 
> >> The manpage doesn't need to give as detailed explanations 
> as the info
> >> manual (though, as it's auto-generated from the info manual, this
> >> could be hard to avoid); but it should fully describe 
> essential features.
> > 
> > I know GNU projects for some reason go with info, but I'm 
> not in fan of
> > that.
> > 
> > Personally I always just use man pages and only revert to using info
> > pages when forced. I simply don't like it when projects "hide"
> > information in info pages.
> 
> Well, the original intention, I think, is that the GNU 
> operating system
> would use info as its primary documentation system, and avoid man
> altogether. However, since in reality people just used GNU programs on
> their own preexisting operating systems, which used nroff/man as their
> primary documentation system, it was useful to provide man pages as
> well. (AIUI.)
> 
> Info is, IMO, a superior format to manpages (but only because that's
> really not saying much). However, my fingers still type "man wget"
> rather than "info wget" much more readily, for two reasons: 
> (1) because
> only GNU programs tend to use Texinfo, whereas practically everything
> (including GNU software) uses man pages, so it's far more
> ubiquitous/habit-forming, and (2) I'm usually looking for a quick
> reference, not an easy-reading manual: I'm pulling man up to type
> "/something-or-other", which, for me, is easiest on an all-in-one
> reference page, than in a separated-by-node info manual.
> 
> However, when I'm actually looking to read up on a _subject_, rather
> than an option or rc command, I'll use the Texinfo manual, 
> since that's
> what it's better-suited for.
> 
> Regardless of personal or group feelings about info, though, I pretty
> much have to have documentation in Texinfo format, as it's expected of
> GNU projects. However, Texinfo doesn't need to be the _source_ format;
> and this discussion makes me toy with the prospect of switching to
> DocBook XML. But I'm not sure I want to be rewriting the 
> manual at this
> point. :p
> 
> - --
> Micah J. Cowan
> Programmer, musician, typesetting enthusiast, gamer...
> http://micah.cowan.name/
> 
> -BEGIN PGP SIGNATURE-
> Version: GnuPG v1.4.6 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
> 
> iD8DBQFGn6ps7M8hyUobTrERCIn5AKCAAk0/4ThESmTO82CYlfye+cNaKQCfVbJI
> c/w+nbC8zasi0gS1VNkkETs=
> =ZkQE
> -END PGP SIGNATURE-
>

Man pages [Re: ignoring robots.txt]

2007-07-19 Thread Micah Cowan

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

Daniel Stenberg wrote:
> On Wed, 18 Jul 2007, Micah Cowan wrote:
> 
>> The manpage doesn't need to give as detailed explanations as the info
>> manual (though, as it's auto-generated from the info manual, this
>> could be hard to avoid); but it should fully describe essential features.
> 
> I know GNU projects for some reason go with info, but I'm not in fan of
> that.
> 
> Personally I always just use man pages and only revert to using info
> pages when forced. I simply don't like it when projects "hide"
> information in info pages.

Well, the original intention, I think, is that the GNU operating system
would use info as its primary documentation system, and avoid man
altogether. However, since in reality people just used GNU programs on
their own preexisting operating systems, which used nroff/man as their
primary documentation system, it was useful to provide man pages as
well. (AIUI.)

Info is, IMO, a superior format to manpages (but only because that's
really not saying much). However, my fingers still type "man wget"
rather than "info wget" much more readily, for two reasons: (1) because
only GNU programs tend to use Texinfo, whereas practically everything
(including GNU software) uses man pages, so it's far more
ubiquitous/habit-forming, and (2) I'm usually looking for a quick
reference, not an easy-reading manual: I'm pulling man up to type
"/something-or-other", which, for me, is easiest on an all-in-one
reference page, than in a separated-by-node info manual.

However, when I'm actually looking to read up on a _subject_, rather
than an option or rc command, I'll use the Texinfo manual, since that's
what it's better-suited for.

Regardless of personal or group feelings about info, though, I pretty
much have to have documentation in Texinfo format, as it's expected of
GNU projects. However, Texinfo doesn't need to be the _source_ format;
and this discussion makes me toy with the prospect of switching to
DocBook XML. But I'm not sure I want to be rewriting the manual at this
point. :p

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer...
http://micah.cowan.name/

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFGn6ps7M8hyUobTrERCIn5AKCAAk0/4ThESmTO82CYlfye+cNaKQCfVbJI
c/w+nbC8zasi0gS1VNkkETs=
=ZkQE
-END PGP SIGNATURE-

Re: ignoring robots.txt

2007-07-19 Thread Andreas Pettersson


Daniel Stenberg wrote:

On Wed, 18 Jul 2007, Micah Cowan wrote:

The manpage doesn't need to give as detailed explanations as the info 
manual (though, as it's auto-generated from the info manual, this 
could be hard to avoid); but it should fully describe essential 
features.


I know GNU projects for some reason go with info, but I'm not in fan 
of that.
Personally I always just use man pages and only revert to using info 
pages when forced. I simply don't like it when projects "hide" 
information in info pages.


Being a FreeBSD user since a couple of years back, reading this thread 
is the first time I've stumbled upon 'info', actually.


--
Andreas

Re: ignoring robots.txt

2007-07-19 Thread Daniel Stenberg


On Wed, 18 Jul 2007, Micah Cowan wrote:

The manpage doesn't need to give as detailed explanations as the info manual 
(though, as it's auto-generated from the info manual, this could be hard to 
avoid); but it should fully describe essential features.


I know GNU projects for some reason go with info, but I'm not in fan of that.

Personally I always just use man pages and only revert to using info pages 
when forced. I simply don't like it when projects "hide" information in info 
pages.

Re: Man pages [Re: ignoring robots.txt]

2007-07-18 Thread Hrvoje Niksic

Micah Cowan <[EMAIL PROTECTED]> writes:

>> Converting from Info to man is harder than it may seem.  The script
>> that does it now is basically a hack that doesn't really work well
>> even for the small part of the manual that it tries to cover.
>
> I'd noticed. :)
>
> I haven't looked at the script that does this work; I had assumed
> that it was some standard tool for this task, but perhaps it's
> something more custom?

Our `texi2pod' comes from GCC, and it would seem that it was written
for GCC/binutils.  The version in the latest binutils is almost
identical to what Wget ships, plus a bug fix or two.  Given its state
and capabilities, I doubt that it is widely used, so unfortunately
it's pretty far from being a standard tool.  I would have much
preferred to use a standard tool, but as far as I knew none was
available at the time.  In fact, I'm not aware of one now, either.

>> As for the "stub" man page... Debian for one finds it unacceptable,
>> and I can kind of understand why.
>
> Yeah, especially since they're frequently forced to leave out the
> "authoritative" manual.

This issue predates the GFDL debacle by several years, but yes, if
anything, things have gotten worse in that department.  Much worse.

Man pages [Re: ignoring robots.txt]

2007-07-18 Thread Micah Cowan

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

Hrvoje Niksic wrote:
> Micah Cowan <[EMAIL PROTECTED]> writes:
> 
>> I think we should either be a "stub", or a fairly complete "manual"
>> (and agree that the latter seems preferable); nothing half-way
>> between: what we have now is a fairly incomplete manual.
> 
> Converting from Info to man is harder than it may seem.  The script
> that does it now is basically a hack that doesn't really work well
> even for the small part of the manual that it tries to cover.

I'd noticed. :)

I haven't looked at the script that does this work; I had assumed that
it was some standard tool for this task, but perhaps it's something more
custom?

> What makes it harder is the impedance mismatch between Texinfo and
> Unix manual philosophies.  What is appropriate for a GNU manual, for
> example tutorial-style nodes, a longish FAQ section, or the inclusion
> of the entire software license, would be completely out of place in a
> man page.  (This is a consequence of Info being hyperlinked, which
> means that it's easier to skip the nodes one is not interested in, at
> least in theory.)  On the other hand, information crucial to any man
> page, such as clearly delimited sections that include SYNOPSIS,
> DESCRIPTION, FILES or SEE ALSO, might not be found in a Texinfo
> document at all, at least not in an easily recognizable and
> extractable form.

Right; by "complete manual", I didn't mean to include such things as FAQ
sections, etc. But yes, it means that one can't simply directly
translate TeXinfo docs into its exact equivalent in *roff.

> As for the "stub" man page... Debian for one finds it unacceptable,
> and I can kind of understand why.

Yeah, especially since they're frequently forced to leave out the
"authoritative" manual.

> When the Debian maintainer stepped down, I agreed with his successor
> to a compromise solution: that a man page would be automatically
> generated from the Info documentation which would contain at least a
> fairly complete list of command-line options.  It was far from
> perfect, but it was still better than nothing, and it was deemed Good
> Enough.  Note that I'm not saying the current solution is good enough
> -- it isn't.  I'm just providing a history of how the current state of
> the affairs came to be.

And thanks very much for that; it has been very informative.

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer...
http://micah.cowan.name/

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD4DBQFGnor27M8hyUobTrERCACzAJYgEZydf/ESX6rCjfYjY76jdNyIAJwPSPZ6
mom+r7VqREv5gGJaSSgQPw==
=SGgY
-END PGP SIGNATURE-

Re: ignoring robots.txt

2007-07-18 Thread Hrvoje Niksic

Micah Cowan <[EMAIL PROTECTED]> writes:

> I think we should either be a "stub", or a fairly complete "manual"
> (and agree that the latter seems preferable); nothing half-way
> between: what we have now is a fairly incomplete manual.

Converting from Info to man is harder than it may seem.  The script
that does it now is basically a hack that doesn't really work well
even for the small part of the manual that it tries to cover.

What makes it harder is the impedance mismatch between Texinfo and
Unix manual philosophies.  What is appropriate for a GNU manual, for
example tutorial-style nodes, a longish FAQ section, or the inclusion
of the entire software license, would be completely out of place in a
man page.  (This is a consequence of Info being hyperlinked, which
means that it's easier to skip the nodes one is not interested in, at
least in theory.)  On the other hand, information crucial to any man
page, such as clearly delimited sections that include SYNOPSIS,
DESCRIPTION, FILES or SEE ALSO, might not be found in a Texinfo
document at all, at least not in an easily recognizable and
extractable form.

As for the "stub" man page... Debian for one finds it unacceptable,
and I can kind of understand why.  When I pulled the man page out of
the distribution, Debian's solution was to keep maintaining the old
man page and disttributing it with their package.  As a result, any
Debian user who issued `man wget' would read Debian-maintained man
page and was at the mercy of the Debian maintainer to have ensured
that the man page was updated as new features arrived.  Since most
Unix users only read the man page and never bother with Info, this was
suboptimal -- a crucial piece of documentation was not inherited from
the project, but produced by Debian.  (I further didn't like that the
maintainer used my original man page even though I explicitly asked
them not to, but that's another matter.)

When the Debian maintainer stepped down, I agreed with his successor
to a compromise solution: that a man page would be automatically
generated from the Info documentation which would contain at least a
fairly complete list of command-line options.  It was far from
perfect, but it was still better than nothing, and it was deemed Good
Enough.  Note that I'm not saying the current solution is good enough
-- it isn't.  I'm just providing a history of how the current state of
the affairs came to be.

Re: ignoring robots.txt

2007-07-18 Thread Micah Cowan

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

Tony Lewis wrote:
> Micah Cowan wrote:
> 
>> Don't we already follow typical etiquette by default? Or do you
>> mean that to override non-default settings in the rcfile or
>> whatnot?
> 
> We don't automatically use a --wait time between requests. I'm not
> sure what other "nice" options we'd want to make easily available,
> but there are probably more.

I suppose either of --wait or --limit-rate would be useful; but if we
decide that something is "nice", we should probably do it by default, as
very, very few users will invoke them explicitly, just to be "nice".

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer...
http://micah.cowan.name/

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFGnnC27M8hyUobTrERCEZdAKCBCOWnkbsx3vPKILbuYBJkbaCGwwCgjZ2g
VSLwsxEgilKl4R5auw7WVLg=
=YCGV
-END PGP SIGNATURE-

RE: ignoring robots.txt

2007-07-18 Thread Tony Lewis

Micah Cowan wrote:

> Don't we already follow typical etiquette by default? Or do you mean
> that to override non-default settings in the rcfile or whatnot?

We don't automatically use a --wait time between requests. I'm not sure what 
other "nice" options we'd want to make easily available, but there are probably 
more.

Tony

Re: ignoring robots.txt

2007-07-18 Thread Micah Cowan

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

Tony Lewis wrote:
> Micah Cowan wrote:
> 
>> The manpage doesn't need to give as detailed explanations as the
>> info manual (though, as it's auto-generated from the info manual,
>> this could be hard to avoid); but it should fully describe
>> essential features.
> 
> I can't see any good reason for one set of documentation to be
> different than another. Let the user choose whatever is comfortable.
> Some users may not even know they have a choice between man and info.

It's mentioned in the manpage, though not nearly as strongly as is
typical for GNU projects.

GNU often has "stub" manpages, which say something along the lines of:

  The  full  documentation  for foo is maintained as a Texinfo manual...

and describe how to invoke info.

If, for some reason, we were to decide that we shouldn't have all the
same info in the manpage as exists in the info manual, we should at
least be calling out the fact that much more information, including a
variety of very useful rc commands, are detailed in the info document.

I think we should either be a "stub", or a fairly complete "manual" (and
agree that the latter seems preferable); nothing half-way between: what
we have now is a fairly incomplete manual.

>> While we're on the subject: should we explicitly warn about using
>> such features as robots=off, and --user-agent? And what should
>> those warnings be? Something like, "Use of this feature may help
>> you download files from which wget would otherwise be blocked, but
>> it's kind of sneaky, and web site administrators may get upset and
>> block your IP address if they discover you using it"?
> 
> No, I don't think we should nor do I think use of those features is
> "sneaky".
> 
> With regard to robots.txt, people use it when they don't want
> *automated* spiders crawling through their sites. A well-crafted wget
> command that downloads selected information from a site without
> regard to the robots.txt restrictions is a very different situation.
> It's true that someone could --mirror the site while ignoring
> robots.txt, but even that is legitimate in many cases.
> 
> With regard to user agent, many websites customize their output based
> on the browser that is displaying the page. If one does not set user
> agent to match their browser, the retrieved content may be very
> different than what was displayed in the browser.

Yes, but I meant with specific intent to get around website
restrictions. Certain sites (image galleries, for instance) often
specifically want to force users to access their resources via the web,
and do not wish to allow users to mass-download their resources for
later offline perusal: they want to force the users to come back each
time to use them--especially if the site requires a subscription of some
sort (e.g., porn), or their ad revenue is directly tied to some "Top
100" list and they want to force you to vote (warez, roms). Of course,
if you're downloading warez, the concept of "sneaky" wget options
probably doesn't concern you overly much! :)

Whether getting around such restrictions with --user-agent and -e
robots=no is "sneaky" is debatable, when you're legitimately accessing
content that you could straightforwardly obtain with your web browser
(where "legitimate" in this context probably means you're subscribed to
the image gallery, or own the physical counterparts to the roms you're
downloading, or what have you), but you'll almost certainly be banned
from the site if you happen to be discovered.

Perhaps this is more FAQ territory than manual territory at any rate: I
was thinking of crafting a FAQ entry for dealing with such issues
(no-follow, especially, seems to trip users up), but I want to craft it
carefully, with ample warnings about what you're doing. :)

> All that being said, it wouldn't hurt to have a section in the
> documentation on wget etiquette: think carefully about ignoring
> robots.txt, use --wait to throttle the download if it will be
> lengthy, etc.

I think that may be wise.

> Perhaps we can even add a --be-nice option similar to --mirror that
> adjusts options to match the etiquette suggestions.

Don't we already follow typical etiquette by default? Or do you mean
that to override non-default settings in the rcfile or whatnot?

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer...
http://micah.cowan.name/

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFGnmul7M8hyUobTrERCFjfAJ4i9bj23B3JdT7BfVdpdGigxU5TFgCfSqPX
SxIOUXW1wNogbRdU2BfwWBI=
=n+Uw
-END PGP SIGNATURE-

RE: ignoring robots.txt

2007-07-18 Thread Tony Lewis

Micah Cowan wrote:

> The manpage doesn't need to give as detailed explanations as the info
> manual (though, as it's auto-generated from the info manual, this could
> be hard to avoid); but it should fully describe essential features.

I can't see any good reason for one set of documentation to be different than 
another. Let the user choose whatever is comfortable. Some users may not even 
know they have a choice between man and info.

> While we're on the subject: should we explicitly warn about using such
> features as robots=off, and --user-agent? And what should those warnings
> be? Something like, "Use of this feature may help you download files
> from which wget would otherwise be blocked, but it's kind of sneaky, and
> web site administrators may get upset and block your IP address if they
> discover you using it"?

No, I don't think we should nor do I think use of those features is "sneaky".

With regard to robots.txt, people use it when they don't want *automated* 
spiders crawling through their sites. A well-crafted wget command that 
downloads selected information from a site without regard to the robots.txt 
restrictions is a very different situation. It's true that someone could 
--mirror the site while ignoring robots.txt, but even that is legitimate in 
many cases.

With regard to user agent, many websites customize their output based on the 
browser that is displaying the page. If one does not set user agent to match 
their browser, the retrieved content may be very different than what was 
displayed in the browser.

All that being said, it wouldn't hurt to have a section in the documentation on 
wget etiquette: think carefully about ignoring robots.txt, use --wait to 
throttle the download if it will be lengthy, etc.

Perhaps we can even add a --be-nice option similar to --mirror that adjusts 
options to match the etiquette suggestions.

Tony

Re: ignoring robots.txt

2007-07-18 Thread Micah Cowan

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

Steven M. Schweda wrote:
> From: Josh Williams
> 
>> As far as I can tell, there's nothing in the man page about it.
> 
>It's pretty well hidden.
> 
>   -e robots=off
> 
> At this point, I normally just grind my teeth instead of complaining
> about the differences between the command-line options and the commands
> in the ".wgetrc" start-up file.

The man page, AFAICT, doesn't list any of the rc commands, or anything
much at all abou the rc file. This should probably be remedied, as this
information is important enough, and people often are in the habit of
typing "man wget" instead of "info wget", to say nothing of those
distributions which might distribute the info manual in a separate
package, due to, say, DFSG issues (which are resolved at the moment in
the trunk).

The manpage doesn't need to give as detailed explanations as the info
manual (though, as it's auto-generated from the info manual, this could
be hard to avoid); but it should fully describe essential features.

While we're on the subject: should we explicitly warn about using such
features as robots=off, and --user-agent? And what should those warnings
be? Something like, "Use of this feature may help you download files
from which wget would otherwise be blocked, but it's kind of sneaky, and
web site administrators may get upset and block your IP address if they
discover you using it"?

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer...
http://micah.cowan.name/

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFGnllo7M8hyUobTrERCL9FAJ9+PucAr/feeyObcOJMn9HZLk+xRACeNnB/
YfLYyzpHQHqkCGPTPbJih5M=
=SaZw
-END PGP SIGNATURE-

Re: ignoring robots.txt

2007-07-18 Thread Steven M. Schweda

From: Josh Williams

> As far as I can tell, there's nothing in the man page about it.

   It's pretty well hidden.

  -e robots=off

At this point, I normally just grind my teeth instead of complaining
about the differences between the command-line options and the commands
in the ".wgetrc" start-up file.



   Steven M. Schweda   [EMAIL PROTECTED]
   382 South Warwick Street(+1) 651-699-9818
   Saint Paul  MN  55105-2547

Re: ignoring robots.txt

2007-07-18 Thread Josh Williams


On 7/18/07, Maciej W. Rozycki <[EMAIL PROTECTED]> wrote:

 There is no particular reason, so we do.


As far as I can tell, there's nothing in the man page about it.

Re: ignoring robots.txt

2007-07-18 Thread Maciej W. Rozycki

On Wed, 18 Jul 2007, Josh Williams wrote:

> Is there any particular reason we don't have an option to ignore robots.txt?

 There is no particular reason, so we do.

  Maciej

Re: Ignoring robots.txt [was Re: wget default behavior...]

Re: Ignoring robots.txt [was Re: wget default behavior...]

Re: Man pages [Re: ignoring robots.txt]

RE: Man pages [Re: ignoring robots.txt]

Man pages [Re: ignoring robots.txt]

Re: ignoring robots.txt

Re: ignoring robots.txt

Re: Man pages [Re: ignoring robots.txt]

Man pages [Re: ignoring robots.txt]

Re: ignoring robots.txt

Re: ignoring robots.txt

RE: ignoring robots.txt

Re: ignoring robots.txt

RE: ignoring robots.txt

Re: ignoring robots.txt

Re: ignoring robots.txt

Re: ignoring robots.txt

Re: ignoring robots.txt

18 matches

Site Navigation

Mail list logo

Footer information