Hi there.

Let me explain the problem:

1) I'm trying to prepare for being a mirror of www.gnu.org
(which is not the most ashamed thing to do, I suppose).

2) I'm somewhat devoted to wget and do not want to use
other software.

3) There're some redirects at www.gnu.org to other hosts
like savannah.gnu.org, gnuhh.org, etc.

4) When I'm doing straight-forward "wget -m -nH http://www.gnu.org";
everything is excellent, except the redirections: the files which we
get because of the redirections overwrite any currently existing
files with the same filenames.

Example:
Let's imagine that wget has downloaded some part of www.gnu.org,
then (of course) it has downloaded the first file (or maybe second,
if robots.txt goes first): index.html (which is
http://www.gnu.org/index.html). Now when wget comes across the
http://www.gnu.org/people/greve/greve.html is gets 302 (moved) to
http://gnuhh.org/. Now it goes right there and downloads index.html,
which immediately overwrites index.html downloaded from
http://www.gnu.org/index.html.


I'd suggest that wget processes redirections as usual links, just
add them to the processing queue and forget about them, do not
download them without previously checked them with
download_child_p().

Using this approach works well if you're mirroring some site, but
might not be the most awaited behaviour when you're downloading
just one page: the page won't be downloaded if it's redirected
to another host. So the second situation needs some different
processing rules.


That's it. Share your opinions, please (especially, Hrvoje,
since you're the maintainer :-)

Peter.

Reply via email to