On Mon, Apr 25, 2016 at 3:31 AM, Simon Slavin <slavins at bigfraud.org> wrote:
> > These are different concerns, and they don't really pose any > > difficulty. Given an encoding, a column of N characters can take up to > > x * N bytes. Back in the day, "x" was 1. Now it's something else. No > > big deal. > > No. Unicode uses different numbers of bytes to store different > characters. You cannot tell from the number of bytes in a string how many > characters it encodes, and the programming required to work out the string > length is complicated. The combination of six bytes > Don't confuse Unicode encodings, and UTF-8 Simon. What you say is true of UTF-8. And UTF-16. But there's also UTF-32, where you *can* tell. --DD PS: Well, kinda, since you can still have several code-points that combine to make a single accented characters, despite most accented characters having their own code point. Granted, UTF-32 is almost never used for storage and only sometimes used in-memory, but efficient processing on some algos, but still. Unicode Codepoints != Variable Length Encoded sequences.