On Mon, Apr 25, 2016 at 1:08 AM, Dominique Devienne <ddevienne at gmail.com>
wrote:

> On Mon, Apr 25, 2016 at 3:31 AM, Simon Slavin <slavins at bigfraud.org>
> wrote:
>
> > > These are different concerns, and they don't really pose any
> > > difficulty.  Given an encoding, a column of N characters can take up to
> > > x * N bytes.  Back in the day, "x" was 1.  Now it's something else.  No
> > > big deal.
> >
> > No.  Unicode uses different numbers of bytes to store different
> > characters.  You cannot tell from the number of bytes in a string how
> many
> > characters it encodes, and the programming required to work out the
> string
> > length is complicated.  The combination of six bytes
> >
>
> Don't confuse Unicode encodings, and UTF-8 Simon. What you say is true of
> UTF-8. And UTF-16. But there's also UTF-32, where you *can* tell. --DD
>
> PS: Well, kinda, since you can still have several code-points that combine
> to make a single accented characters,
>   despite most accented characters having their own code point.
>   Granted, UTF-32 is almost never used for storage and only sometimes used
> in-memory,
>   but efficient processing on some algos, but still. Unicode Codepoints !=
> Variable Length Encoded sequences.
>

Even with UTF-32 there is not a correlation between "characters" and
"codepoints". One character can in UTF-32 be built from multiple code
points. Unicode processing is far more complex, in any UTF, than simple
single byte character sets like ASCII.
-- 
Scott Robison

Reply via email to