Re: AL32UTF8

2004-04-30 Thread Lincoln A. Baxter
On Thu, 2004-04-29 at 11:16, Tim Bunce wrote: > Am I right in thinking that perl's internal utf8 representation > represents surrogates as a single (4 byte) code point and not as > two separate code points? > > This is the form that Oracle call AL32UTF8. > > What would be the effect of setting Sv

Re: AL32UTF8

2004-04-30 Thread Larry Wall
On Thu, Apr 29, 2004 at 09:23:45PM +0300, Jarkko Hietaniemi wrote: : Tim Bunce wrote: : : > Am I right in thinking that perl's internal utf8 representation : > represents surrogates as a single (4 byte) code point and not as : > two separate code points? : : Mmmh. Right and wrong... as a single

Re: AL32UTF8

2004-04-30 Thread Jarkko Hietaniemi
> > Okay. Thanks. > > Basically I need to document that Oracle "AL32UTF8" should be used > as the client charset in preference to the older "UTF8" because > "UTF8" doesn't do the "best"? thing with surrogate pairs. "because what Oracle calls UTF8 is not conformant with the modern definition of U

Re: AL32UTF8

2004-04-30 Thread Tim Bunce
[The background to this is that Lincoln and I have been working on Unicode support for DBD::Oracle. (Actually Lincoln's done most of the heavy lifting, I've mostly been setting goals and directions at the DBI API level and scratching at edge cases. Like this one.)] On Thu, Apr 29, 2004 at 09:23:45

Re: AL32UTF8

2004-04-30 Thread Tim Bunce
On Thu, Apr 29, 2004 at 10:42:18PM -0400, Lincoln A. Baxter wrote: > On Thu, 2004-04-29 at 11:16, Tim Bunce wrote: > > Am I right in thinking that perl's internal utf8 representation > > represents surrogates as a single (4 byte) code point and not as > > two separate code points? > > > > This is

Re: AL32UTF8

2004-04-30 Thread Martin Hosken
Dear Tim, "CESU-8 defines an encoding scheme for Unicode identical to UTF-8 except for its representation of supplementary characters. In CESU-8, supplementary characters are represented as six-byte sequences resulting from the transformation of each UTF-16 surrogate code unit into an eight-bit fo

Re: AL32UTF8

2004-04-30 Thread Tim Bunce
On Fri, Apr 30, 2004 at 03:49:13PM +0300, Jarkko Hietaniemi wrote: > > > > Okay. Thanks. > > > > Basically I need to document that Oracle "AL32UTF8" should be used > > as the client charset in preference to the older "UTF8" because > > "UTF8" doesn't do the "best"? thing with surrogate pairs. >

Re: AL32UTF8

2004-04-30 Thread Tim Bunce
On Fri, Apr 30, 2004 at 10:58:19PM +0700, Martin Hosken wrote: > Dear Tim, > > >"CESU-8 defines an encoding scheme for Unicode identical to UTF-8 > >except for its representation of supplementary characters. In CESU-8, > >supplementary characters are represented as six-byte sequences > >resulting