Hi Hack,

Thanks for getting back to me. I didn't realize that the new version 1.6
existed, however, it has some of the same problems. I tried it "as is"
and it failed on problem #1 that I identified below. It doesn't really
matter if the password has the @ in it, any HTTP redirect seems to throw
the password off, even in 1.6. 

I also tried to use version 1.6 to rip a free website, but purposely
specified my username (with the %40 = '@' in it) in the command line. It
failed to get that website as well beyond the very first HTML file.
Removing the password of course fixed the problem. 

It seems to me that wget could use some work, but I am sure than 1.7-dev
is much better and you've taken care of these problems. My making the
patch is probably not a very good idea, since I hacked the 1.5.3 code to
work under Windows 2000 and couldn't do a very good job in 3 hours. I
don't think you want it. But the basic idea is that whenever wget is
redirected with 301, or follows *any* link, I make sure that the new
link gets the password from the cur_url link before we even try to
follow the new link. Thus, suppose page A is passworded. Page A has a
link to page B (no password there). However, page B references D, which
does have the password. Then, my code whenever it follows links, keeps
the same password in all transitions A->B->D and succeeds in coming back
into the protected area cleanly. Furthermore, site A might have a
different DNS name, say X, and wget will drop the password in that case
again (i.e., A->B->X, or A->X). 

The hack around @ is not as clean, but it works in my case (may not work
in general). I suggest that you decouple the password from the URL. In
wget, both are *always* kept together in the field called url or smth
similar. This creates confusion upon calling parse_url() and similar
functions. My suggestion -- take the password out of the URL in the very
beginning of a session, and keep it separate. 

Thanks
Dmitri

Hack Kampbjørn wrote:
> 
> Please try the latest wget version 1.6 or even better try the CVS
> developement (version 1.7-dev). Take a look at http://sunsite.dk/wget
> for instruccions on how to get it.
> 
> There has been done some work on improving wget's handling of passwords,
> specifically the handling of '@' in passwords. But if not all of your
> cases has been addresse, consider submitting your patch. The web-site
> also says how the wget development team prefers to receive such patches
> (diff -u against the CVS source)
> 
> Dmitri Loguinov wrote:
> >
> > Hi
> >
> > I am sure you're aware of the fact that wget 1.5.3 does not properly
> > handle passworded HTTP sites (even with Basic authentication). There are
> > several areas where the username/password are silently "dropped" in the
> > code, and wget tries to access the same site with no password.
> > Furthermore, the deal was complicated, because my username contained
> > character '@'. Handling of the character was OK in retrieving the first
> > page (because it was marked as %40), but upon redirection and other
> > stuff described below, the password was dropped because the code is
> > written sloppily.
> >
> > 1. HTTP code 301 -- page permanently moved. The site I worked with,
> > always redirected every page to http://site:80 and would not accept
> > http://site. Therefore, upon redirection, it's important to keep the
> > password in the code, which does not happen in wget.
> >
> > 2. The same site referenced itself with fully qualified URLs. Such as,
> > instead of saying href = "main.html" it would say href =
> > "http://site/directory/main.html." Wget would lose the password in that
> > case as well. Furthermore, wget would think that the URL belongs to a
> > *different* site and would not take the link if the -L (i.e., local
> > files only) option is specified. This was apparently because the cur_url
> > contained the password, but the href did not (again, some patching was
> > needed to bypass the first @ as part of my username).
> >
> > 3. If the username contains @ (such an email address), then after a few
> > iterations of the main code, the %40 would eventually get replaced by @
> > and upon future searches for the site name, the code would get stuck on
> > the first symbol @ instead of the second one, which separates the
> > password from the website. Consider this URL:
> > '[EMAIL PROTECTED]@www.site.com/main' -- once the %40 is expanded to the
> > first @, the code would NOT convert it back to %40 as required by one of
> > the RFCs.
> >
> > It took me about 3 hours to patch the code, but I am not sure what other
> > functionality I might have disabled or affected. To tell the truth, it
> > is quite annoying that simple things like these were not thought of by
> > whoever wrote the code. Anyhow, thanks for writing it. :)
> >
> > Dmitri
> 
> --
> Med venlig hilsen / Kind regards
> 
> Hack Kampbjørn               [EMAIL PROTECTED]
> HackLine                     +45 2031 7799

Reply via email to