RE: surrogate at java's property file

Carl W. Brown Thu, 04 Oct 2001 18:04:39 -0700

Addison,

It might be easier to convert the JVM from UCS-2 to UTF-32 so that you do
not have to worry about surrogates.  This would more closely match most Unix
implementations (except Sun) where Java is widely used.


Carl


> -----Original Message-----
> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On
> Behalf Of Addison Phillips [wM]
> Sent: Wednesday, October 03, 2001 4:31 PM
> To: Yung-Fong Tang
> Cc: [EMAIL PROTECTED]; [EMAIL PROTECTED]
> Subject: RE: surrogate at java's property file
>
>
> No fair! You forgot to quote my disclaimer in the next email for my big
> boo-boo regarding what an int is in Java. An int is fine, darnit!
> It's char
> that was originally (at least externally) limited to 16-bits. Of course,
> many APIs use ints, which don't present a problem. But java.lang.Character
> and java.lang.String would have to change internal representation or add
> methods or something to allow surrogate pairs to be evaluated.
>
> Addison
>
> -----Original Message-----
> From: Yung-Fong Tang [mailto:[EMAIL PROTECTED]]
> Sent: Wednesday, October 03, 2001 4:17 PM
> To: Addison Phillips [wM]
> Cc: [EMAIL PROTECTED]; [EMAIL PROTECTED]
> Subject: Re: surrogate at java's property file
>
>
> Brian Beck:
> What do you think ?
>
> "Addison Phillips [wM]" wrote:
>
> > Java doesn't define any characters beyond Unicode 2.1.8 at the moment.
> It's
> > stuck in a time-warp. JDK 1.4 will update to Unicode 3.0... neither of
> these
> > versions have defined characters in the supplemental planes.
> >
> > In Java, a java.lang.Character object is closely tied to the
> definition of
> > an "int", the 16-bit numeric type. Many classes and objects make no
> > distinction (or worse, conflate a character with an int---many
> methods are
> > defined to take and return ints for "Characters"). As a result, the Java
> > character model appears to be tied to UCS-2 (and I don't mean UTF-16). A
> > surrogate character *is* recognized to be a surrogate, but a
> high-low pair
> > is not recognized as representing a character, nor can you retrieve the
> > character properties of the matched pair.
> >
> > So to property files. The java.lang.Character sequence U+D800 U+DC00 is
> > represented by the sequence "\ud800\udc00". This sequence does NOT
> represent
> > U+10000. It represents TWO Characters, which happen to be
> surrogates that
> > form a valid pair. I should point out that Java is slightly clever. For
> > example, the UTF-8 converter knows that U+D800 U+DC00 represents the
> scalar
> > value U+10000 and encodes it as a valid four byte sequence: f0-90-80-80
> (and
> > vice versa, of course).
> >
> > However, it is unclear how Unicode 3.1 support is going to make it into
> JDK
> > 1.4++. The APIs are going to have to change to support the supplemental
> > planes and the ripple effects on various APIs seems like an interesting
> > problem. Perhaps they'll redefine an int to be a 32-bit value and switch
> > Java to UTF-32 (yeah, sure.....)
> >
> > Best Regards,
> >
> > Addison
> >
> > Addison P. Phillips
> > Globalization Architect / Manager, Globalization Engineering
> > webMethods, Inc.  432 Lakeside Drive, Sunnyvale, CA
> > +1 408.962.5487 (phone)  +1 408.210.3569 (mobile)
> > -------------------------------------------------
> > Internationalization is an architecture. It is not a feature.
> >
> > -----Original Message-----
> > From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On
> > Behalf Of Yung-Fong Tang
> > Sent: Monday, October 01, 2001 5:10 PM
> > To: [EMAIL PROTECTED]
> > Subject: surrogate at java's property file
> >
> > Any one know how does Java handle Surrogate pair property file ?
> >
> > Java's property file use the \u encoding for non ASCII characters,
> > therefore U+00a5 is \u00A5. I wonder anyone know how does it handle
> > Surrogate Pair?
> >
> > Does U+10000 (0xd800 0xdc00) encoded as "\u10000" or "\ud800\udc00" ? (I
> > think it should be \u10000) or they cannot handle them at all ?
>
>
>

RE: surrogate at java's property file

Reply via email to