-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

It appears that some people (including myself) are confused by the fact
that wget will download files that match a rejection pattern (or fail to
match an accept pattern), if the file type is text/html.

The manual says:

  "Note that these two options do not affect the downloading of HTML
files; Wget must load all the HTMLs to know where to go at
all--recursive retrieval would make no sense otherwise.

That might potentially apply to brain-dead uses such as -Rhtml, but what
about -R '*cgi-bin* and the like?

One user (Frank Lui, Cc'd) recently submitted a bug report/complaint
that despite a reject list of
"*\?rev*,*\?sortcol*,*\?raw*,*\?skin*,*\?template*", wget was
downloading tons of files whose URIs differed only in their use of such
parameters. I believe Frank was just trying to mirror a wiki, with just
the current versions of the wiki (so, not downloading other revisions,
sorting variants, different actions, etc).

This is a very reasonable and common sort of expectation, and AFAICT
there is no way to accomplish this with current wget. This seems like a
problem.

Is there any real reason that we can't just always reject files if they
match the reject list? Or, would it be worth adding an extra option to
allow even HTML files to be skipped?

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer...
http://micah.cowan.name/

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFGzM017M8hyUobTrERCIXJAJ4gsfxbogYnr+jKS6a4scKh8TmG1QCeIra0
hBZ/w0LaiSftI0R3nSbwlfQ=
=928c
-----END PGP SIGNATURE-----

Reply via email to