correct processing of redirections

2003-11-25 Thread Peter Kohts
Hi there.

Let me explain the problem:

1) I'm trying to prepare for being a mirror of www.gnu.org
(which is not the most ashamed thing to do, I suppose).

2) I'm somewhat devoted to wget and do not want to use
other software.

3) There're some redirects at www.gnu.org to other hosts
like savannah.gnu.org, gnuhh.org, etc.

4) When I'm doing straight-forward wget -m -nH http://www.gnu.org;
everything is excellent, except the redirections: the files which we
get because of the redirections overwrite any currently existing
files with the same filenames.

Example:
Let's imagine that wget has downloaded some part of www.gnu.org,
then (of course) it has downloaded the first file (or maybe second,
if robots.txt goes first): index.html (which is
http://www.gnu.org/index.html). Now when wget comes across the
http://www.gnu.org/people/greve/greve.html is gets 302 (moved) to
http://gnuhh.org/. Now it goes right there and downloads index.html,
which immediately overwrites index.html downloaded from
http://www.gnu.org/index.html.


I'd suggest that wget processes redirections as usual links, just
add them to the processing queue and forget about them, do not
download them without previously checked them with
download_child_p().

Using this approach works well if you're mirroring some site, but
might not be the most awaited behaviour when you're downloading
just one page: the page won't be downloaded if it's redirected
to another host. So the second situation needs some different
processing rules.


That's it. Share your opinions, please (especially, Hrvoje,
since you're the maintainer :-)

Peter.



Re: need help on a link

2003-11-23 Thread Peter Kohts

QC wget -r --convert-links http://netcity4.web.hinet.net/UserData/tsd04/1122-2.htm -o 
test.log
QC worked. It downloaded the html and files accoicated with it.
QC But
QC wget -r --convert-links http://netcity4.web.hinet.net/UserData/tsd04/1122-3.htm -o 
test.log
QC only downloaded the html file and nothing else.
QC Can someone tell me how to make wget work with the second link?
QC Thank you very much.

QC Qian


Try wget -H -p -k http://netcity4.web.hinet.net/UserData/tsd04/1122-3.htm -o test.log


Peter.



Fw: Re[2]: follow_ftp not work

2003-11-17 Thread Peter Kohts
 Follow ftp is off by default, so you shouldn't need to set it
 explicitly.
 
 What might have happened in your case is that a http URL redirected*
 to ftp, which was followed as a redirection, not as part of the
 recursive download.


Hrvoje,

it looks like it would take much less time to implement something
like --disregard-external-redirects than to explain everyone that
the feature is not yet available.

If you're up to implementing this, I'd suggest a supplemental
--store-external-redirects option, which would create
a .htaccess file with external redirects, which would be
EXTREMELY useful for mirroring HUGE sites (like www.gnu.org).


Luck,
Peter.

ps: This letter was sent personaly to Hrvoje by error,
sorry Hrvoje.