Re: correct processing of redirections

2003-11-26 Thread Hrvoje Niksic
Peter Kohts [EMAIL PROTECTED] writes:

 4) When I'm doing straight-forward wget -m -nH http://www.gnu.org;
 everything is excellent, except the redirections: the files which we
 get because of the redirections overwrite any currently existing
 files with the same filenames.

I see your point.  Redirections to other hosts are indeed somewhat
evil, and I'm becoming convinced that the way that Wget handles them
now is suboptimal.  Fixing this correctly will require some thinking,
but a short-term workaround might be to provide an option to ignore
redirections to other hosts.


correct processing of redirections

2003-11-25 Thread Peter Kohts
Hi there.

Let me explain the problem:

1) I'm trying to prepare for being a mirror of www.gnu.org
(which is not the most ashamed thing to do, I suppose).

2) I'm somewhat devoted to wget and do not want to use
other software.

3) There're some redirects at www.gnu.org to other hosts
like savannah.gnu.org, gnuhh.org, etc.

4) When I'm doing straight-forward wget -m -nH http://www.gnu.org;
everything is excellent, except the redirections: the files which we
get because of the redirections overwrite any currently existing
files with the same filenames.

Example:
Let's imagine that wget has downloaded some part of www.gnu.org,
then (of course) it has downloaded the first file (or maybe second,
if robots.txt goes first): index.html (which is
http://www.gnu.org/index.html). Now when wget comes across the
http://www.gnu.org/people/greve/greve.html is gets 302 (moved) to
http://gnuhh.org/. Now it goes right there and downloads index.html,
which immediately overwrites index.html downloaded from
http://www.gnu.org/index.html.


I'd suggest that wget processes redirections as usual links, just
add them to the processing queue and forget about them, do not
download them without previously checked them with
download_child_p().

Using this approach works well if you're mirroring some site, but
might not be the most awaited behaviour when you're downloading
just one page: the page won't be downloaded if it's redirected
to another host. So the second situation needs some different
processing rules.


That's it. Share your opinions, please (especially, Hrvoje,
since you're the maintainer :-)

Peter.