No fair! You forgot to quote my disclaimer in the next email for my big boo-boo regarding what an int is in Java. An int is fine, darnit! It's char that was originally (at least externally) limited to 16-bits. Of course, many APIs use ints, which don't present a problem. But java.lang.Character and java.lang.String would have to change internal representation or add methods or something to allow surrogate pairs to be evaluated.
Addison -----Original Message----- From: Yung-Fong Tang [mailto:[EMAIL PROTECTED]] Sent: Wednesday, October 03, 2001 4:17 PM To: Addison Phillips [wM] Cc: [EMAIL PROTECTED]; [EMAIL PROTECTED] Subject: Re: surrogate at java's property file Brian Beck: What do you think ? "Addison Phillips [wM]" wrote: > Java doesn't define any characters beyond Unicode 2.1.8 at the moment. It's > stuck in a time-warp. JDK 1.4 will update to Unicode 3.0... neither of these > versions have defined characters in the supplemental planes. > > In Java, a java.lang.Character object is closely tied to the definition of > an "int", the 16-bit numeric type. Many classes and objects make no > distinction (or worse, conflate a character with an int---many methods are > defined to take and return ints for "Characters"). As a result, the Java > character model appears to be tied to UCS-2 (and I don't mean UTF-16). A > surrogate character *is* recognized to be a surrogate, but a high-low pair > is not recognized as representing a character, nor can you retrieve the > character properties of the matched pair. > > So to property files. The java.lang.Character sequence U+D800 U+DC00 is > represented by the sequence "\ud800\udc00". This sequence does NOT represent > U+10000. It represents TWO Characters, which happen to be surrogates that > form a valid pair. I should point out that Java is slightly clever. For > example, the UTF-8 converter knows that U+D800 U+DC00 represents the scalar > value U+10000 and encodes it as a valid four byte sequence: f0-90-80-80 (and > vice versa, of course). > > However, it is unclear how Unicode 3.1 support is going to make it into JDK > 1.4++. The APIs are going to have to change to support the supplemental > planes and the ripple effects on various APIs seems like an interesting > problem. Perhaps they'll redefine an int to be a 32-bit value and switch > Java to UTF-32 (yeah, sure.....) > > Best Regards, > > Addison > > Addison P. Phillips > Globalization Architect / Manager, Globalization Engineering > webMethods, Inc. 432 Lakeside Drive, Sunnyvale, CA > +1 408.962.5487 (phone) +1 408.210.3569 (mobile) > ------------------------------------------------- > Internationalization is an architecture. It is not a feature. > > -----Original Message----- > From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On > Behalf Of Yung-Fong Tang > Sent: Monday, October 01, 2001 5:10 PM > To: [EMAIL PROTECTED] > Subject: surrogate at java's property file > > Any one know how does Java handle Surrogate pair property file ? > > Java's property file use the \u encoding for non ASCII characters, > therefore U+00a5 is \u00A5. I wonder anyone know how does it handle > Surrogate Pair? > > Does U+10000 (0xd800 0xdc00) encoded as "\u10000" or "\ud800\udc00" ? (I > think it should be \u10000) or they cannot handle them at all ?

