2016-09-15 21:56 GMT+02:00 Janusz S. Bień <[email protected]>: > On Thu, Sep 15 2016 at 21:27 CEST, [email protected] writes: > > [...] > > > Isn't "grapheme cluster" the definition you are looking for? > > I don't think so. > > However: > > 1. Graphemes, if I understand correctly, are language dependent, textels > are not. >
Your definition of textels is also language dependant, as you are reading it from a Polish point of view. However you are confusing here "graphemes" with "grapheme clusters". Your (Polish) textels are in fact the same as the (Polish) grapheme clusters. Unicode also defines "default grapheme clusters" that are "grapheme clusters" not tailored for a particular language. A "default grapheme clusters" is the minimum unbreakable unit that can be seen as a valid "grapheme cluster" in most languages (or at least in most languages using the same base script if the script is used in that language; in other scripts, it just provides a minimum compatibility level to allow insertion of foreign texts in a multilingual document). The grapheme clusters can then be used to parse text and apply various processes such as - normalization : grapheme clusters are not broken by it and can be compared for canonical equivalences (but you can compare smaller units using only the combining class property by breaking text on characters with CC=0 and handling the special algorithmic case of modern Hangul syllables; see the Unicode standard about normalization) - BiDi layout - line breaking - word breaking - most standard text transforms (such as case folding) - transliteration Rendering text however often requires larger units as successive grapheme clusters (if not split by a line break or by BiDi reoredring) will interact visually to create more complex layouts (notably in Indic scripts), glued together by some controls (notably joining controls); they are also compelxified in some cases where combining classes alone cannot properly represent these interactions. Additionnally for a few cases, the visual order is used for encoding text instead of the standard model using the logical order: this was made to preserve the roundtrip compatibility between Unicode and legacy encodings widely used (notably for the Thai script). However this has a known caveat (which already existed before Unicode) for some algorithms such as word breaking (implementaitons need to implement a lookup dictionnary, but in Thai this dictionnary is not very large) and line breaking (if we don't want to break words or in the middle oif syllables). The default grapheme clusters however will correctly break the text to allow Thai text (encoded in visual order) to be rendered correctly. In summary, the concept of "grapheme clusters" must be read and understood in the Unicode standard only as a Unicode terminology used to describe all other algorithms described in the standard. They are not bound to a particular language except if thsi language is explicitly specified with this term in that case we won't be handling the "default grapheme clusters" rules but the additional rules tailoring the basic rules used to define the default grapheme clusters. The "extended grapheme clusters" are used in context requiring more complex algorithms that need to group several grapheme clusters in a ordered sequence. These algorithms require some text buffering, and parsing from a random position in text may require looking backward on larger lengths to determine the context. Parsing text sequentially also requires keeping some additional context variables. Plain text searches based on "extended grapheme clusters" is also much more challenging than searches on "default grapheme clusters". For these reasons, the "extended grapheme clusters" are not defined in "default grapheme clusters" but will be needed for matching user expectations in particular languages or scripts. You normally don't need any "extended grapheme clusters" in Polish, except in multilingual documents that are embedding some non-Latin scripts, or some technical notations. > 2. Textel "ń" means both U+0144 and <U+006E,U+0301>, so it is a notion > on a higher abstraction level then a grapheme cluster. > > Moreover I don't want to call <U+006E,U+0301> (LATIN SMALL LETTER N, > COMBINING ACUTE ACCENT) an extended grapheme cluster for at least 2 > reasons: > > 1. there is nothing extended in it > This <U+006E,U+0301> combination is first a "grapheme cluster", before being also an "extended grapheme cluster" in Unicode terminology. The term "extended" comes from an extension added not for the case of combining chacters encoded after base characters (or combined to them in a canonically equivalent string), but for other extensions, notably for complex syllabic constructs: Every "grapheme cluster" may also be an "extended grapheme cluster", but the reverse is NOT true. You have to read the standard about the various kind of text breaking processes. > 2. U+0301 is not a grapheme according to Polish linguistics terminology > The Polish lingusitics uses its own Polish term, not "grapheme" which is in the standard what is defined there in English, but for being the base of other definitions needed for parsing texts in various languages. In Unicode U+0301 would be a grapheme, but if used in isolation it would not form a complete grapheme cluster, but a defective grapheme cluster as it lacks its base with which it should be associated and encoded before it (that base cannot be a non-character or a control, even if these are blockers against reordering for normalization processes and canonical equivalences, and cannot be another combining character)

