On 5 Jul 2001, at 22:20, Jacob Burckhardt wrote:
> Ian Abbott writes:
> > On 4 Jul 2001, at 23:20, Jacob Burckhardt wrote:
> >
> > > I run wget on this file:
> > >
> > > <! ------------------------------------------------------ >
> > > <A HREF="a.html">a</a>
> > > <! ------------------------------------------------------ >
> > > <A HREF="b.html">b</a>
> > >
> > > It downloads b.html, but it does not download a.html.
> >
> > This is not HTML, nor valid SGML, so you shouldn't be too surprised
> > at the behavior.
>
> I first ran into the problem when I tried to recursively download the
> following URL:
>
> http://www.uniontrib.com/news/uniontrib/mon/opinion/inside.html
[snip]
> I tested the above URL on Netscape, lynx, and the emacs W3 web
> browser, and all of them displayed the link "Facts increasingly losing
> to fiction" which links to the file news_1e2leo.html. In Netscape, I
> clicked on the "Facts increasingly losing to fiction", and then
> Netscape followed the link and displayed it. The other browsers also
> were able to follow the link. Even wget-1.6 followed it. But
> wget-1.7 did not.
Wget's HTML parser will have been changed a few times since wget-1.6.
FWIW, Microsoft IE 5.5 also displayed both links from your simplified
example.
[snip]
> Anyway, I would appreciate any suggestions on how to use wget to
> download the above URL and make it follow to and download
> news_1e2leo.html.
Use wget-1.6 :-)
Otherwise you'd have to wait for a discussion of the "best" way to
parse this invalid HTML, a patch, and CVS updates, and you'd have to
be prepared to do a build from CVS sources or wait for a release.
I think the safest way to parse it would be to skip to the closing
'>' if the character after the '<!' is whitespace. Since it doesn't
follow the usual rules for SGML declarations there is no reason why
the parser has to assume that any '--' sequence in it starts or ends
a comment.