On Fri, 24 Jun 2011 at 11:24:50 +0200, Remko Tronçon wrote: > > So I'd say that we should refer to characters in a string, and deal with > > Unicode code-points in the abstract. > > I'm wondering whether 'code points' are any better than UTF-8 based > positioning. Isn't it possible that a codepoint position also points > inside a character/glyph/...?
A codepoint is the fundamental thing defined by Unicode, but there is a related concept which could be called a character (or grapheme?), consisting of one or more codepoints (a codepoint representing a non-combining character, followed by zero or more codepoints representing combining characters). (A glyph is something different, and as far as I can tell is only interesting if you make fonts or font-rendering algorithms.) In UTF-8 a codepoint is one or more bytes, in UTF-16 a codepoint is either one or two 16-bit words, and in UCS-4 a codepoint is one 32-bit word. Here are some codepoints: * U+0041 LATIN CAPITAL LETTER A * U+00C1 LATIN CAPITAL LETTER A WITH ACUTE * U+0301 COMBINING ACUTE ACCENT The grapheme Á could either be written as U+0041 U+0301 (decomposed form), or U+00C1 (composed form). Not all graphemes have a composed form. > For example, in Qt, this would most likely be > implemented using a QTextCursor ( > http://doc.trolltech.com/4.7/qtextcursor.html ). However, the text > talks about 'positioning at character X', and it doesn't seem to be > defined what this means. That might either be counting graphemes or codepoints, depending... S
