On 5 Jul 2001, at 22:20, Jacob Burckhardt wrote:

> Ian Abbott writes:
> > On 4 Jul 2001, at 23:20, Jacob Burckhardt wrote:
> > 
> > > I run wget on this file:
> > > 
> > > <! ------------------------------------------------------ >
> > > <A HREF="a.html">a</a>
> > > <! ------------------------------------------------------ >
> > > <A HREF="b.html">b</a>
> > > 
> > > It downloads b.html, but it does not download a.html.
> > 
> > This is not HTML, nor valid SGML, so you shouldn't be too surprised 
> > at the behavior.
> 
> I first ran into the problem when I tried to recursively download the
> following URL:
> 
> http://www.uniontrib.com/news/uniontrib/mon/opinion/inside.html
[snip]
> I tested the above URL on Netscape, lynx, and the emacs W3 web
> browser, and all of them displayed the link "Facts increasingly losing
> to fiction" which links to the file news_1e2leo.html.  In Netscape, I
> clicked on the "Facts increasingly losing to fiction", and then
> Netscape followed the link and displayed it.  The other browsers also
> were able to follow the link.  Even wget-1.6 followed it.  But
> wget-1.7 did not.

Wget's HTML parser will have been changed a few times since wget-1.6.

FWIW, Microsoft IE 5.5 also displayed both links from your simplified 
example.

[snip]
> Anyway, I would appreciate any suggestions on how to use wget to
> download the above URL and make it follow to and download
> news_1e2leo.html.

Use wget-1.6 :-)

Otherwise you'd have to wait for a discussion of the "best" way to 
parse this invalid HTML, a patch, and CVS updates, and you'd have to 
be prepared to do a build from CVS sources or wait for a release.

I think the safest way to parse it would be to skip to the closing 
'>' if the character after the '<!' is whitespace. Since it doesn't 
follow the usual rules for SGML declarations there is no reason why 
the parser has to assume that any '--' sequence in it starts or ends 
a comment.


Reply via email to