Re: Migrating to tomcat 6 gives formatted currency amounts problem

André Warnier Fri, 12 Sep 2008 13:15:29 -0700

Rectification to the clarification : what I say below about UTF-16 beingalways 16-bit and limited is also nonsense. UTF-16 is variable-length,it can cover the entire Unicode character set. It just uses a variablenumber of 16-bit words per character, as compared to UTF-8 which uses avariable number of 8-bit bytes.

I should have checked my sources. Shame on me.

About Java's internal char type being 16-bit wide though, I have heardthat too, and I'm also curious.


André Warnier wrote:

Caldarale, Charles R wrote:
From: Christopher Schultz [mailto:[EMAIL PROTECTED]
Subject: Re: Migrating to tomcat 6 gives formatted currency
amounts problem

(My understanding is that Unicode (16-bit) is actually not
big enough for everything, but hey, they tried).
Point of clarification: Unicode is NOT limited to 16 bits (not even inJava, these days). There are defined code points that use 32 bits,and I don't think there's a limit, if you use the defined extensionmechanisms. Again, browsing the Unicode web site is extremelyenlightening.
Further clarification :
Unicode is not limited to anything. Unicode is (aims to be) a listwhich attributes to any distinct character known to man, a number, from0 to infinity. The particular position number given to a particularcharacter in this Unicode list is known as its "Unicode codepoint".The Unicode group (consortium ?) also tries to do this with some order,such as trying to keep together (with consecutive codepoints) variousgroups of characters that are logically related in some way.For example (but probably because they had to start somewhere), thefirst 128 codepoints match the original 7-bit US-ASCII alphabet;so for instance the "capital letter A", which has code \x41 in US-ASCII,happens to have Unicode codepoint \x0041 (both 65 in decimal terms).For example also, the same first 128 codepoints, plus the next 128codepoints, match the iso-8859-1 alphabet (also known as iso-latin-1);thus the character known as "capital letter A with umlaut" (an A with adouble-dot on top) has the codepoint \x00C4 in Unicode, and the code\xC4 in iso-8859-1 (both 196 in decimal).
New Unicode characters (and codepoints) are being added all the time (Ithink there's even Klingon in there), but there are also holes in thelist (presumably left for whenever some forgotten related charactershows up).
A quite different issue is encoding.
Because it would be quite impractical to specify a series of charactersjust by writing their codepoints one after the other (using whatevernumber of bits each codepoint needs), a series of clever schemes havebeen devised in order to pass Unicode strings around, while being ableto separate them into characters, and keep each one with its propercodepoint.Such schemes are known as "Unicode encodings" with names such as UTF-2,UTF-7, UTF-8, UTF-16, UTF-32, etc..Each one of them specifies an algorithm whereby one can take any Unicodecharacter (or rather, its codepoint), and "encode" it into a series ofbits, in such a way that at the receiving end, an opposite algorithm canbe used to "decode" that series of bits and retrieve once again the sameseries of Unicode codepoints (or characters).
UTF-16, for example, is an encoding of Unicode which uses always 16 bitsfor each Unicode codepoint; but it is to my knowledge incomplete,because since it uses a fixed number of 16 bit per character, it canthus only ever represent no more than the first 65,532 Unicodecharacters. (But we're not there yet, and there is still some leeway).
UTF-8 on the other hand is a variable-length scheme, using 1, 2, 3, ormore 8-bit groups to represent each Unicode codepoint. And it is inprinciple not limited, as there are extension mechanisms foreseen forwhenever the need arises (imagine that some aliens suddenly show up, andthat they happen to write in 167 different languages and alphabets).
One frequent misconception is that in UTF-8, the first 256 "characterencoding bit sequences" match the iso-8859-1 codepoints.Only the first 128 characters of iso-8859-1 (which happen to match the128 characters of US-ASCII and the first 128 Unicode codepoints), have asingle-byte representation in UTF-8 which happens to match their Unicodecodepoint. The next 128 iso-8859-1 characters (which contain thecapital A with umlaut) require 2 bytes each in the UTF-8 encoding.Thus for instance, the "capital letter A with umlaut" has the Unicodecodepoint \x00C4 (196 decimal), because is is the 197th character in theUnicode list (and the first one is \x0000). It also happens to have thecode \xC4 (196 decimal) in the iso-8859-1 table.But in UTF-8, it is encoded as the two bytes \xC3\x84, which is not thedecimal number 196 in any way.
All of that to say that when some people on this list say things like"you should always decode your URLs as if they were Unicode (or UTF-8),because it is the same as ASCII or iso-latin-1 anyway", they are talkingnonsense. The only time you can do that is when the server and all theclients have agreed in advance that this is how they were going toencode and decode URLs.(That we developers wish it were so, and that ultimately we may getthere, is another matter.)
It is also talking nonsense to say that you should by default considerhtml pages as UTF-8 encoded. The default character set (and encoding,because in that case both are the same) for html is iso-8859-1, andanything else (including UTF-8 or UTF-16) is non-default.
(see http://www.ietf.org/rfc/rfc2854.txt, section 6).
(So if you do output something else, you *must* say so).
(And hope that IE doesn't second-guess you).
We probably owe that to Tim Berners-Lee, and with tons of respect andadmiration for the guy notwithstanding, it may be an unfortunatehistorical accident that he was born in England and worked inSwitzerland (both countries quite happy with iso-8859-1), rather thanbeing a Chinese national working in Greece e.g., who might havepreferred Unicode and UTF-8. But hey, he invented it, so he got to choose.
Anyway for the time being we all have to live with it.
Even the Tomcat guys.


---------------------------------------------------------------------
To start a new topic, e-mail: users@tomcat.apache.org
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To start a new topic, e-mail: users@tomcat.apache.org
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Migrating to tomcat 6 gives formatted currency amounts problem

Reply via email to