Re: A strange bit of HTML

Hrvoje Niksic Wed, 16 Jan 2002 16:42:07 -0800

[EMAIL PROTECTED] writes:

>> Until there's an ESP package that can guess what the author
>> intended, I doubt wget has any choice but to ignore the defective
>> tag.
> 
> Seriously, I think you guys are too strict.
> Similar discussion have spawned numerous times.
> If the HTML code says 
> <a href="URL" yaddayada my-Mother=Shopping%5 "going">supermarket</a>
> Why can't wget just ignore everything after ...URL"?


Because, as he said, Wget can parse text, not read minds.  For
example, you must know where a tag ends to be able to look for the
next one, or to find comments.  It is not enough to look for '>' to
determine the tag's ending -- something like <img alt="<my dog>"
src="foo"> is a perfectly legal tag.

In other words, you have to destructure the tag, not only to retrieve
the URLs, but to be able to continue parsing.  If the tag is not
syntactically valid, the parsing fails, on to other tags.  Wget has
never been able to pick apart every piece of broken HTML.

As for us being strict, I can only respond with a mini-rant...

Wget doesn't create web standards, but it tries to support them.
Spanning the chasm between the standards as written and the actual
crap generated by HTML generators feels a lot like shoveling shit.
Some amount of shoveling is necessary and is performed by all small
programs to protect their users, but there has to be a point where you
draw the line.  There is only so much shit Wget can shovel.

I'm not saying Ian's example is where the line has to be drawn.  (Your
example is equivalent to Ian's -- Wget would only choke on the last
"going" part).  But I'm sure that the line exists and that it is not
far from those two examples.

Re: A strange bit of HTML

Reply via email to