Addison, It might be easier to convert the JVM from UCS-2 to UTF-32 so that you do not have to worry about surrogates. This would more closely match most Unix implementations (except Sun) where Java is widely used.
Carl > -----Original Message----- > From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On > Behalf Of Addison Phillips [wM] > Sent: Wednesday, October 03, 2001 4:31 PM > To: Yung-Fong Tang > Cc: [EMAIL PROTECTED]; [EMAIL PROTECTED] > Subject: RE: surrogate at java's property file > > > No fair! You forgot to quote my disclaimer in the next email for my big > boo-boo regarding what an int is in Java. An int is fine, darnit! > It's char > that was originally (at least externally) limited to 16-bits. Of course, > many APIs use ints, which don't present a problem. But java.lang.Character > and java.lang.String would have to change internal representation or add > methods or something to allow surrogate pairs to be evaluated. > > Addison > > -----Original Message----- > From: Yung-Fong Tang [mailto:[EMAIL PROTECTED]] > Sent: Wednesday, October 03, 2001 4:17 PM > To: Addison Phillips [wM] > Cc: [EMAIL PROTECTED]; [EMAIL PROTECTED] > Subject: Re: surrogate at java's property file > > > Brian Beck: > What do you think ? > > "Addison Phillips [wM]" wrote: > > > Java doesn't define any characters beyond Unicode 2.1.8 at the moment. > It's > > stuck in a time-warp. JDK 1.4 will update to Unicode 3.0... neither of > these > > versions have defined characters in the supplemental planes. > > > > In Java, a java.lang.Character object is closely tied to the > definition of > > an "int", the 16-bit numeric type. Many classes and objects make no > > distinction (or worse, conflate a character with an int---many > methods are > > defined to take and return ints for "Characters"). As a result, the Java > > character model appears to be tied to UCS-2 (and I don't mean UTF-16). A > > surrogate character *is* recognized to be a surrogate, but a > high-low pair > > is not recognized as representing a character, nor can you retrieve the > > character properties of the matched pair. > > > > So to property files. The java.lang.Character sequence U+D800 U+DC00 is > > represented by the sequence "\ud800\udc00". This sequence does NOT > represent > > U+10000. It represents TWO Characters, which happen to be > surrogates that > > form a valid pair. I should point out that Java is slightly clever. For > > example, the UTF-8 converter knows that U+D800 U+DC00 represents the > scalar > > value U+10000 and encodes it as a valid four byte sequence: f0-90-80-80 > (and > > vice versa, of course). > > > > However, it is unclear how Unicode 3.1 support is going to make it into > JDK > > 1.4++. The APIs are going to have to change to support the supplemental > > planes and the ripple effects on various APIs seems like an interesting > > problem. Perhaps they'll redefine an int to be a 32-bit value and switch > > Java to UTF-32 (yeah, sure.....) > > > > Best Regards, > > > > Addison > > > > Addison P. Phillips > > Globalization Architect / Manager, Globalization Engineering > > webMethods, Inc. 432 Lakeside Drive, Sunnyvale, CA > > +1 408.962.5487 (phone) +1 408.210.3569 (mobile) > > ------------------------------------------------- > > Internationalization is an architecture. It is not a feature. > > > > -----Original Message----- > > From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On > > Behalf Of Yung-Fong Tang > > Sent: Monday, October 01, 2001 5:10 PM > > To: [EMAIL PROTECTED] > > Subject: surrogate at java's property file > > > > Any one know how does Java handle Surrogate pair property file ? > > > > Java's property file use the \u encoding for non ASCII characters, > > therefore U+00a5 is \u00A5. I wonder anyone know how does it handle > > Surrogate Pair? > > > > Does U+10000 (0xd800 0xdc00) encoded as "\u10000" or "\ud800\udc00" ? (I > > think it should be \u10000) or they cannot handle them at all ? > > >

