* Andr�-John Mas wrote:
>I have tried searching for documentation on URLs and double-byte
>characters, even searched this mailing-list, but could find
>nothing concrete.

http://www.w3.org/International/O-URL-and-ident.html

>For me the issue has arrisen because I am writing a servlet that
>allows for the browsing of a virtual directory structure that in
>certain cases have entries that have chinese names.
>
>I have looked for some algorithms, but while they worked in the
>majority of cases failed in a few special cases:
>
>   - %20%3A%22
>     -- is this a space followed by one double byte character, or
>     two single byte characters?
>
>   - %3A%20%22
>     -- single byte character, space, single byte character OR
>     double byte character, single byte character OR single
>     byte character, double byte character?

The TAG seems to agree that only the server knows what %xx escaped
octets represent, see their recents minutes at

  http:[EMAIL PROTECTED]

If that's true, there are some errors in RFC 2396 that give a different
impression, see

  http:[EMAIL PROTECTED]

>Using Mozilla I find that it encodes it utf-8 urls with a mixture
>of single byte and double characters. For example, a space will
>be represented as %20, any reserved ASCII character will use a
>single byte %xx value, but anything in chinese will be defined
>using a double byte %xx%yy value. This makes is very difficult
>to parse a URL. I would say that the problem is with Mozilla,
>but for me the real problem is the lack of any documentation
>on the issue.

URIs with non-ASCII characters are invalid, thus you are responsible to
%xx escape your URI references properly. If you do this, no user agent I
am aware of will touch the URI and the server can deal with them as it
likes to. How to recover from inalid URIs is undefined.

Reply via email to