Ian Abbott writes:
> On 4 Jul 2001, at 23:20, Jacob Burckhardt wrote:
>
> > I run wget on this file:
> >
> > <! ------------------------------------------------------ >
> > <A HREF="a.html">a</a>
> > <! ------------------------------------------------------ >
> > <A HREF="b.html">b</a>
> >
> > It downloads b.html, but it does not download a.html.
>
> This is not HTML, nor valid SGML, so you shouldn't be too surprised
> at the behavior.
I first ran into the problem when I tried to recursively download the
following URL:
http://www.uniontrib.com/news/uniontrib/mon/opinion/inside.html
wget-1.7 followed links in that page and downloaded the files
news_1e2hentoff.html, news_1e2lubrano.html, and others, but it did not
download news_1e2leo.html. So I tried to simplify the code by deleting
parts which seemed irrelevant to the problem. The above 4 lines of
(invalid) HTML is the result of my simplification. In case I made a
mistake when simplifying, I thought I should tell you the above URL.
I tested the above URL on Netscape, lynx, and the emacs W3 web
browser, and all of them displayed the link "Facts increasingly losing
to fiction" which links to the file news_1e2leo.html. In Netscape, I
clicked on the "Facts increasingly losing to fiction", and then
Netscape followed the link and displayed it. The other browsers also
were able to follow the link. Even wget-1.6 followed it. But
wget-1.7 did not.
Maybe the above URL also has invalid HTML or SGML code, but if so, 3
browsers and wget-1.6 still parsed it the way the author obviously
intended. If these browsers and the author of the web page are not
following the official HTML standard, then I realize that is bad, but
it would still be nice to be able to download this link. Maybe wget
should only follow the official HTML standard, but if it does that,
then it won't download the link, and downloading it is what is important
to me. Of course, I realize I might be completely wrong about the
problem being invalid HTML.
Anyway, I would appreciate any suggestions on how to use wget to
download the above URL and make it follow to and download
news_1e2leo.html.
Note that the URL is for a newspaper and the newspaper's policy is to
remove current news after a week. This web page will no longer be
accessible on about 7-8-2001. So if you want to try wget on this URL,
you will need to try it before then.
Thanks.
> What wget is doing is skipping over SGML
> declarations and the comments in those declarations. One of those
> comments is started by the last two hyphens on the line 1 and
> terminated by the first two hyphens on line 3, so the whole of line 2
> is commented out.
>
> > However, if the following file is used, then it does download a.html:
> >
> > <! ------------------------------------------------------ >
> > <A HREF="a.html">a</a>
>
> The last two hyphens on line 1 start a comment which is not
> terminated. Because it is not terminated, wget backs out and
> continues parsing the document anyway. Perhaps it shouldn't.
>