Re: URLs and double byte characters (unicode)

xuefer tinys Sun, 22 Dec 2002 19:00:24 -0800

no, that's the very UTF-8.
i guess when u're reference to unicode, u meant UTF-16
in UTF-8, all ascii still have 1byte.
u can still urldecode them into UTF-8 by the old function. and then convert UTF-8 to UTF-16, which u need

From: Andr?John Mas <[EMAIL PROTECTED]>
To: [EMAIL PROTECTED]
Subject: URLs and double byte characters (unicode)
Date: Sun, 22 Dec 2002 10:12:05 -0500



Hi,

I have tried searching for documentation on URLs and double-byte
characters, even searched this mailing-list, but could find
nothing concrete.

For me the issue has arrisen because I am writing a servlet that
allows for the browsing of a virtual directory structure that in
certain cases have entries that have chinese names.

I have looked for some algorithms, but while they worked in the
majority of cases failed in a few special cases:

  - %20%3A%22
    -- is this a space followed by one double byte character, or
    two single byte characters?

  - %3A%20%22
    -- single byte character, space, single byte character OR
    double byte character, single byte character OR single
    byte character, double byte character?

Using Mozilla I find that it encodes it utf-8 urls with a mixture
of single byte and double characters. For example, a space will
be represented as %20, any reserved ASCII character will use a
single byte %xx value, but anything in chinese will be defined
using a double byte %xx%yy value. This makes is very difficult
to parse a URL. I would say that the problem is with Mozilla,
but for me the real problem is the lack of any documentation
on the issue. An RFC would be nice, so at least I know I am
dealing with the same solution with all modern web browsers.

regards

Andre


_________________________________________________________________
������� MSN Explorer:  http://explorer.msn.com/lccn/

Re: URLs and double byte characters (unicode)

Reply via email to