* Andr�-John Mas wrote: >I have tried searching for documentation on URLs and double-byte >characters, even searched this mailing-list, but could find >nothing concrete.
http://www.w3.org/International/O-URL-and-ident.html >For me the issue has arrisen because I am writing a servlet that >allows for the browsing of a virtual directory structure that in >certain cases have entries that have chinese names. > >I have looked for some algorithms, but while they worked in the >majority of cases failed in a few special cases: > > - %20%3A%22 > -- is this a space followed by one double byte character, or > two single byte characters? > > - %3A%20%22 > -- single byte character, space, single byte character OR > double byte character, single byte character OR single > byte character, double byte character? The TAG seems to agree that only the server knows what %xx escaped octets represent, see their recents minutes at http:[EMAIL PROTECTED] If that's true, there are some errors in RFC 2396 that give a different impression, see http:[EMAIL PROTECTED] >Using Mozilla I find that it encodes it utf-8 urls with a mixture >of single byte and double characters. For example, a space will >be represented as %20, any reserved ASCII character will use a >single byte %xx value, but anything in chinese will be defined >using a double byte %xx%yy value. This makes is very difficult >to parse a URL. I would say that the problem is with Mozilla, >but for me the real problem is the lack of any documentation >on the issue. URIs with non-ASCII characters are invalid, thus you are responsible to %xx escape your URI references properly. If you do this, no user agent I am aware of will touch the URI and the server can deal with them as it likes to. How to recover from inalid URIs is undefined.
