Rectification to the clarification : what I say below about UTF-16 being always 16-bit and limited is also nonsense. UTF-16 is variable-length, it can cover the entire Unicode character set. It just uses a variable number of 16-bit words per character, as compared to UTF-8 which uses a variable number of 8-bit bytes.
I should have checked my sources. Shame on me.

About Java's internal char type being 16-bit wide though, I have heard that too, and I'm also curious.

André Warnier wrote:
Caldarale, Charles R wrote:
From: Christopher Schultz [mailto:[EMAIL PROTECTED]
Subject: Re: Migrating to tomcat 6 gives formatted currency
amounts problem

(My understanding is that Unicode (16-bit) is actually not
big enough for everything, but hey, they tried).

Point of clarification: Unicode is NOT limited to 16 bits (not even in Java, these days). There are defined code points that use 32 bits, and I don't think there's a limit, if you use the defined extension mechanisms. Again, browsing the Unicode web site is extremely enlightening.

Further clarification :
Unicode is not limited to anything. Unicode is (aims to be) a list which attributes to any distinct character known to man, a number, from 0 to infinity. The particular position number given to a particular character in this Unicode list is known as its "Unicode codepoint". The Unicode group (consortium ?) also tries to do this with some order, such as trying to keep together (with consecutive codepoints) various groups of characters that are logically related in some way. For example (but probably because they had to start somewhere), the first 128 codepoints match the original 7-bit US-ASCII alphabet; so for instance the "capital letter A", which has code \x41 in US-ASCII, happens to have Unicode codepoint \x0041 (both 65 in decimal terms). For example also, the same first 128 codepoints, plus the next 128 codepoints, match the iso-8859-1 alphabet (also known as iso-latin-1); thus the character known as "capital letter A with umlaut" (an A with a double-dot on top) has the codepoint \x00C4 in Unicode, and the code \xC4 in iso-8859-1 (both 196 in decimal).

New Unicode characters (and codepoints) are being added all the time (I think there's even Klingon in there), but there are also holes in the list (presumably left for whenever some forgotten related character shows up).

A quite different issue is encoding.

Because it would be quite impractical to specify a series of characters just by writing their codepoints one after the other (using whatever number of bits each codepoint needs), a series of clever schemes have been devised in order to pass Unicode strings around, while being able to separate them into characters, and keep each one with its proper codepoint. Such schemes are known as "Unicode encodings" with names such as UTF-2, UTF-7, UTF-8, UTF-16, UTF-32, etc.. Each one of them specifies an algorithm whereby one can take any Unicode character (or rather, its codepoint), and "encode" it into a series of bits, in such a way that at the receiving end, an opposite algorithm can be used to "decode" that series of bits and retrieve once again the same series of Unicode codepoints (or characters).

UTF-16, for example, is an encoding of Unicode which uses always 16 bits for each Unicode codepoint; but it is to my knowledge incomplete, because since it uses a fixed number of 16 bit per character, it can thus only ever represent no more than the first 65,532 Unicode characters. (But we're not there yet, and there is still some leeway).

UTF-8 on the other hand is a variable-length scheme, using 1, 2, 3, or more 8-bit groups to represent each Unicode codepoint. And it is in principle not limited, as there are extension mechanisms foreseen for whenever the need arises (imagine that some aliens suddenly show up, and that they happen to write in 167 different languages and alphabets).

One frequent misconception is that in UTF-8, the first 256 "character encoding bit sequences" match the iso-8859-1 codepoints. Only the first 128 characters of iso-8859-1 (which happen to match the 128 characters of US-ASCII and the first 128 Unicode codepoints), have a single-byte representation in UTF-8 which happens to match their Unicode codepoint. The next 128 iso-8859-1 characters (which contain the capital A with umlaut) require 2 bytes each in the UTF-8 encoding. Thus for instance, the "capital letter A with umlaut" has the Unicode codepoint \x00C4 (196 decimal), because is is the 197th character in the Unicode list (and the first one is \x0000). It also happens to have the code \xC4 (196 decimal) in the iso-8859-1 table. But in UTF-8, it is encoded as the two bytes \xC3\x84, which is not the decimal number 196 in any way.


All of that to say that when some people on this list say things like "you should always decode your URLs as if they were Unicode (or UTF-8), because it is the same as ASCII or iso-latin-1 anyway", they are talking nonsense. The only time you can do that is when the server and all the clients have agreed in advance that this is how they were going to encode and decode URLs. (That we developers wish it were so, and that ultimately we may get there, is another matter.)

It is also talking nonsense to say that you should by default consider html pages as UTF-8 encoded. The default character set (and encoding, because in that case both are the same) for html is iso-8859-1, and anything else (including UTF-8 or UTF-16) is non-default.
(see http://www.ietf.org/rfc/rfc2854.txt, section 6).
(So if you do output something else, you *must* say so).
(And hope that IE doesn't second-guess you).

We probably owe that to Tim Berners-Lee, and with tons of respect and admiration for the guy notwithstanding, it may be an unfortunate historical accident that he was born in England and worked in Switzerland (both countries quite happy with iso-8859-1), rather than being a Chinese national working in Greece e.g., who might have preferred Unicode and UTF-8. But hey, he invented it, so he got to choose.

Anyway for the time being we all have to live with it.
Even the Tomcat guys.


---------------------------------------------------------------------
To start a new topic, e-mail: users@tomcat.apache.org
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To start a new topic, e-mail: users@tomcat.apache.org
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to