Re: image tags not read

Tony Lewis Sat, 04 Jan 2003 06:47:57 -0800

Johannes Berg wrote:

> Maybe this isn't really a bug in wget but rather in the file, but since
> this is standard as exported from MS Word I'd like to see wget recognize
> the images and download them.


Microsoft Word claims to create a valid HTML file. In fact, what it creates
can only reliably be read by Internet Explorer. (It may even only be read by
recent versions of Internet Explorer.) The file that it produces contains a
number of proprietary tags as well as proprietary variations of standard
HTML that only Microsoft understands.

wget has a simple HTML parser that cannot understand these variations. While
there may be someone who is interested in patching the wget parser to deal
with Word's pseudo-HTML, I doubt that such changes would ever become part of
a standard wget release.

You might have better luck finding someone who is willing to write a program
to convert Word's pseudo-HTML into real HTML that can be read by most HTML
parsers. Since you're in an academic setting, your odds of finding someone
willing to do this kind of program might be higher. Good luck.

Tony

Re: image tags not read

Reply via email to