Re: Some html wierdness

Ian Abbott Mon, 15 Oct 2001 05:47:32 -0700

On 14 Oct 2001, at 20:06, [EMAIL PROTECTED] wrote:

> Anyway, in her index file 3 urls have a CR in them for the file name,
> which causes the website to fail to send back the file to wget because
> wget is sending the string unfiltered.
> For example:
> 
> <a href="HW
> 02_Sol.html">Solution to HW02
>       </a>
> 
> Strange formatting, but retrieval fails because the filename is being
> passed literally back to the website.
[snip]
> [...] Is \n allowed in link_uri????


I'm not entirely sure what browsers are supposed to do with 
embedded raw whitespace in URLs within HTML. Both initial and 
trailing raw whitespace should be stripped (and wget gets that 
right too), but the "correct" handling of embedded raw whitespace 
seems to be a bit of a grey area. The relevant RFCs 1738 and 2396 
discuss using whitespace to break up URIs, but limit themselves to 
discussing plain text and printed forms of URIs, rather than live 
URIs used within elements of HTML.

Some browsers ignore raw whitespace in URLs entirely (e.g. 
Netscape, lynx, links). Some others (e.g. MS Internet Explorer, 
Opera, Konqueror) will ignore all raw whitespace _except_ for 
embedded space characters, which they treat the same as %20.

For campatibility, I guess wget should either follow the Microsoft 
scheme of ignoring all embedded whitespace in URLs except for the 
space character, or follow the Netscape scheme of ignoring all 
embedded whitespace in URLs.

Re: Some html wierdness

Reply via email to