On Thu, 2004-04-29 at 11:16, Tim Bunce wrote:
> Am I right in thinking that perl's internal utf8 representation
> represents surrogates as a single (4 byte) code point and not as
> two separate code points?
>
> This is the form that Oracle call AL32UTF8.
>
> What would be the effect of setting Sv
On Thu, Apr 29, 2004 at 09:23:45PM +0300, Jarkko Hietaniemi wrote:
: Tim Bunce wrote:
:
: > Am I right in thinking that perl's internal utf8 representation
: > represents surrogates as a single (4 byte) code point and not as
: > two separate code points?
:
: Mmmh. Right and wrong... as a single
>
> Okay. Thanks.
>
> Basically I need to document that Oracle "AL32UTF8" should be used
> as the client charset in preference to the older "UTF8" because
> "UTF8" doesn't do the "best"? thing with surrogate pairs.
"because what Oracle calls UTF8 is not conformant with the modern
definition of U
[The background to this is that Lincoln and I have been working on
Unicode support for DBD::Oracle. (Actually Lincoln's done most of
the heavy lifting, I've mostly been setting goals and directions
at the DBI API level and scratching at edge cases. Like this one.)]
On Thu, Apr 29, 2004 at 09:23:45
On Thu, Apr 29, 2004 at 10:42:18PM -0400, Lincoln A. Baxter wrote:
> On Thu, 2004-04-29 at 11:16, Tim Bunce wrote:
> > Am I right in thinking that perl's internal utf8 representation
> > represents surrogates as a single (4 byte) code point and not as
> > two separate code points?
> >
> > This is
Dear Tim,
"CESU-8 defines an encoding scheme for Unicode identical to UTF-8
except for its representation of supplementary characters. In CESU-8,
supplementary characters are represented as six-byte sequences
resulting from the transformation of each UTF-16 surrogate code
unit into an eight-bit fo
On Fri, Apr 30, 2004 at 03:49:13PM +0300, Jarkko Hietaniemi wrote:
> >
> > Okay. Thanks.
> >
> > Basically I need to document that Oracle "AL32UTF8" should be used
> > as the client charset in preference to the older "UTF8" because
> > "UTF8" doesn't do the "best"? thing with surrogate pairs.
>
On Fri, Apr 30, 2004 at 10:58:19PM +0700, Martin Hosken wrote:
> Dear Tim,
>
> >"CESU-8 defines an encoding scheme for Unicode identical to UTF-8
> >except for its representation of supplementary characters. In CESU-8,
> >supplementary characters are represented as six-byte sequences
> >resulting