On 14 Oct 2001, at 20:06, [EMAIL PROTECTED] wrote: > Anyway, in her index file 3 urls have a CR in them for the file name, > which causes the website to fail to send back the file to wget because > wget is sending the string unfiltered. > For example: > > <a href="HW > 02_Sol.html">Solution to HW02 > </a> > > Strange formatting, but retrieval fails because the filename is being > passed literally back to the website. [snip] > [...] Is \n allowed in link_uri????
I'm not entirely sure what browsers are supposed to do with embedded raw whitespace in URLs within HTML. Both initial and trailing raw whitespace should be stripped (and wget gets that right too), but the "correct" handling of embedded raw whitespace seems to be a bit of a grey area. The relevant RFCs 1738 and 2396 discuss using whitespace to break up URIs, but limit themselves to discussing plain text and printed forms of URIs, rather than live URIs used within elements of HTML. Some browsers ignore raw whitespace in URLs entirely (e.g. Netscape, lynx, links). Some others (e.g. MS Internet Explorer, Opera, Konqueror) will ignore all raw whitespace _except_ for embedded space characters, which they treat the same as %20. For campatibility, I guess wget should either follow the Microsoft scheme of ignoring all embedded whitespace in URLs except for the space character, or follow the Netscape scheme of ignoring all embedded whitespace in URLs.
