Hi Hack,
Thanks for getting back to me. I didn't realize that the new version 1.6
existed, however, it has some of the same problems. I tried it "as is"
and it failed on problem #1 that I identified below. It doesn't really
matter if the password has the @ in it, any HTTP redirect seems to throw
the password off, even in 1.6.
I also tried to use version 1.6 to rip a free website, but purposely
specified my username (with the %40 = '@' in it) in the command line. It
failed to get that website as well beyond the very first HTML file.
Removing the password of course fixed the problem.
It seems to me that wget could use some work, but I am sure than 1.7-dev
is much better and you've taken care of these problems. My making the
patch is probably not a very good idea, since I hacked the 1.5.3 code to
work under Windows 2000 and couldn't do a very good job in 3 hours. I
don't think you want it. But the basic idea is that whenever wget is
redirected with 301, or follows *any* link, I make sure that the new
link gets the password from the cur_url link before we even try to
follow the new link. Thus, suppose page A is passworded. Page A has a
link to page B (no password there). However, page B references D, which
does have the password. Then, my code whenever it follows links, keeps
the same password in all transitions A-B-D and succeeds in coming back
into the protected area cleanly. Furthermore, site A might have a
different DNS name, say X, and wget will drop the password in that case
again (i.e., A-B-X, or A-X).
The hack around @ is not as clean, but it works in my case (may not work
in general). I suggest that you decouple the password from the URL. In
wget, both are *always* kept together in the field called url or smth
similar. This creates confusion upon calling parse_url() and similar
functions. My suggestion -- take the password out of the URL in the very
beginning of a session, and keep it separate.
Thanks
Dmitri
Hack Kampbjrn wrote:
Please try the latest wget version 1.6 or even better try the CVS
developement (version 1.7-dev). Take a look at http://sunsite.dk/wget
for instruccions on how to get it.
There has been done some work on improving wget's handling of passwords,
specifically the handling of '@' in passwords. But if not all of your
cases has been addresse, consider submitting your patch. The web-site
also says how the wget development team prefers to receive such patches
(diff -u against the CVS source)
Dmitri Loguinov wrote:
Hi
I am sure you're aware of the fact that wget 1.5.3 does not properly
handle passworded HTTP sites (even with Basic authentication). There are
several areas where the username/password are silently "dropped" in the
code, and wget tries to access the same site with no password.
Furthermore, the deal was complicated, because my username contained
character '@'. Handling of the character was OK in retrieving the first
page (because it was marked as %40), but upon redirection and other
stuff described below, the password was dropped because the code is
written sloppily.
1. HTTP code 301 -- page permanently moved. The site I worked with,
always redirected every page to http://site:80 and would not accept
http://site. Therefore, upon redirection, it's important to keep the
password in the code, which does not happen in wget.
2. The same site referenced itself with fully qualified URLs. Such as,
instead of saying href = "main.html" it would say href =
"http://site/directory/main.html." Wget would lose the password in that
case as well. Furthermore, wget would think that the URL belongs to a
*different* site and would not take the link if the -L (i.e., local
files only) option is specified. This was apparently because the cur_url
contained the password, but the href did not (again, some patching was
needed to bypass the first @ as part of my username).
3. If the username contains @ (such an email address), then after a few
iterations of the main code, the %40 would eventually get replaced by @
and upon future searches for the site name, the code would get stuck on
the first symbol @ instead of the second one, which separates the
password from the website. Consider this URL:
'[EMAIL PROTECTED]@www.site.com/main' -- once the %40 is expanded to the
first @, the code would NOT convert it back to %40 as required by one of
the RFCs.
It took me about 3 hours to patch the code, but I am not sure what other
functionality I might have disabled or affected. To tell the truth, it
is quite annoying that simple things like these were not thought of by
whoever wrote the code. Anyhow, thanks for writing it. :)
Dmitri
--
Med venlig hilsen / Kind regards
Hack Kampbjrn [EMAIL PROTECTED]
HackLine +45 2031 7799