Re: "textels"
On Tue, Sep 20 2016 at 18:09 CEST, d...@ewellic.org writes: > Janusz Bień wrote: > >> For me it means that Swift's characters are equivalence classes of the >> set of extended grapheme clusters by canonical equivalence relation. > > I still hope we can come to some conclusion on the correct Unicode name > for this concept. I don't think non-Unicode interpretations of terms > like "grapheme" are grounds for throwing out "grapheme cluster," I agree. > but I can see that the equivalence class itself is lacking a name. I'glad. > > Note that the Swift definition doesn't say that <00E9> and <0065 0301> > are identical entities, only that the language compares them as equal. I'm fully aware of this. Best regards Janusz -- , Prof. dr hab. Janusz S. Bien - Uniwersytet Warszawski (Katedra Lingwistyki Formalnej) Prof. Janusz S. Bien - University of Warsaw (Formal Linguistics Department) jsb...@uw.edu.pl, jsb...@mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/
Re: "textels"
Janusz Bień wrote: > For me it means that Swift's characters are equivalence classes of the > set of extended grapheme clusters by canonical equivalence relation. I still hope we can come to some conclusion on the correct Unicode name for this concept. I don't think non-Unicode interpretations of terms like "grapheme" are grounds for throwing out "grapheme cluster," but I can see that the equivalence class itself is lacking a name. Note that the Swift definition doesn't say that <00E9> and <0065 0301> are identical entities, only that the language compares them as equal. -- Doug Ewell | Thornton, CO, US | ewellic.org
Re: "textels"
Janusz S. Bien: > > From the Unicode glossary: > >> Grapheme. (1) A minimally distinctive unit of writing in the context of a >> particular writing system.[...] (2) What a user thinks of as a character. > >> User-Perceived Character. What everyone thinks of as a character in their >> script. > > […] the definitions are language/locale dependent. A writing system is (usually) language-dependent, a script is not, although some scripts have been used exclusively (or prominently) in a single writing system with a single language. So definition (1) of ‘grapheme’ would be appropriate for linguistics, (2) maybe for typography and computer science, but it’Í extremely vague.
Re: "textels"
Quote/Cytat - Christoph Päper(pią, 16 wrz 2016, 23:51:38): Janusz S. Bień : 1. Graphemes, if I understand correctly, are language dependent, … That’s true in linguistic terminology – well, at least within the more popular schools of thought –, but not in technical (i.e. Unicode) jargon. From the Unicode glossary: Grapheme. (1) A minimally distinctive unit of writing in the context of a particular writing system.[...] (2) What a user thinks of as a character. As for (2), cf. User-Perceived Character. What everyone thinks of as a character in their script. So we have "a user" versus "everyone...in their script" - is the difference intentional? Probably not. Anyway the definitions are language/locale dependent. Regards Janusz -- Prof. dr hab. Janusz S. Bień - Uniwersytet Warszawski (Katedra Lingwistyki Formalnej) Prof. Janusz S. Bień - University of Warsaw (Formal Linguistics Department) jsb...@uw.edu.pl, jsb...@mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/
Re: "textels"
Janusz S. Bień: > > 1. Graphemes, if I understand correctly, are language dependent, … That’s true in linguistic terminology – well, at least within the more popular schools of thought –, but not in technical (i.e. Unicode) jargon.
Re: "textels"
Quote/Cytat - Eric Muller(pią, 16 wrz 2016, 17:47:27): On 9/16/2016 8:30 AM, Janusz S. Bien wrote: Quote/Cytat - Eric Muller (pią, 16 wrz 2016, 17:03:54): On 9/16/2016 6:52 AM, Janusz S. Bień wrote: (when working on a corpus of historical Polish we noticed some cases where standard Unicode equivalence was not convenient). I'm very interested to know more about those cases. For our search engine we were unable to use compatibility equivalence "out of the box" for splitting the ligature because it also converted long s to short s while we wanted to preserve the distinction. I am interested in the problems with *canonical* equivalence. I thought that you were talking about those before. I apologize for the confusion, that was my fault. I tend to answer too quickly and not precisely enough :-( On the other hand I'm not sure canonical equivalence is always what I want and expect, but I don't have specific examples at hand. Regards Janusz -- Prof. dr hab. Janusz S. Bień - Uniwersytet Warszawski (Katedra Lingwistyki Formalnej) Prof. Janusz S. Bień - University of Warsaw (Formal Linguistics Department) jsb...@uw.edu.pl, jsb...@mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/
Re: "textels"
On 9/16/2016 8:30 AM, Janusz S. Bien wrote: Quote/Cytat - Eric Muller(pią, 16 wrz 2016, 17:03:54): On 9/16/2016 6:52 AM, Janusz S. Bień wrote: (when working on a corpus of historical Polish we noticed some cases where standard Unicode equivalence was not convenient). I'm very interested to know more about those cases. For our search engine we were unable to use compatibility equivalence "out of the box" for splitting the ligature because it also converted long s to short s while we wanted to preserve the distinction. I am interested in the problems with *canonical* equivalence. I thought that you were talking about those before. Compatibility equivalence is a completely different beast. It is, IMHO, too coarse a tool and best forgotten. For any particular task, it's typically doing too much (e.g. long/short s folding in your case) and too little (not everything you need). There was an attempt at improving the situation, by providing a whole bunch of fine grained, targeted transformations (http://www.unicode.org/reports/tr30/), but that did not pan out. Eric. Thanks, Eric.
Re: "textels"
Quote/Cytat - Eric Muller(pią, 16 wrz 2016, 17:03:54): On 9/16/2016 6:52 AM, Janusz S. Bień wrote: (when working on a corpus of historical Polish we noticed some cases where standard Unicode equivalence was not convenient). I'm very interested to know more about those cases. For our search engine we were unable to use compatibility equivalence "out of the box" for splitting the ligature because it also converted long s to short s while we wanted to preserve the distinction. Regards Janusz -- Prof. dr hab. Janusz S. Bień - Uniwersytet Warszawski (Katedra Lingwistyki Formalnej) Prof. Janusz S. Bień - University of Warsaw (Formal Linguistics Department) jsb...@uw.edu.pl, jsb...@mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/
Re: "textels"
>(I also don't quite understand the semantics of a base character followed by >tag characters, to say the truth.) Page 2 of the following document is where the idea was introduced. http://www.unicode.org/L2/L2015/15145r-add-regional-ind.pdf The document is linked from the following page. http://www.unicode.org/L2/L2015/Register-2015.html William Overington 16 September 2016
Re: "textels"
jsb...@mimuw.edu.pl wrote: > On Thu, Sep 15 2016 at 21:27 CEST, e...@gnu.org writes: [...] >> Isn't "grapheme cluster" the definition you are looking for? > I don't think so. Is an example of a textel that would definitely not be a grapheme cluster be when a character is expressed as a BASE CHARACTER character followed by one or more TAG CHARACTER characters. Such a construct was first suggested for some flag characters. William Overington 16 September 2016
Re: "textels"
On 9/16/2016 6:52 AM, Janusz S. Bień wrote: (when working on a corpus of historical Polish we noticed some cases where standard Unicode equivalence was not convenient). I'm very interested to know more about those cases. Thanks, Eric.
Re: "textels"
On Thu, Sep 15 2016 at 21:56 CEST, jsb...@mimuw.edu.pl writes: [...] > 1. Graphemes, if I understand correctly, are language dependent, textels > are not. > > 2. Textel "ń" means both U+0144 and , so it is a notion > on a higher abstraction level then a grapheme cluster. In other words, textels are equivalence classes of some set of Unicode characters strings by an equivalence relation which at the moment is open to the discussion but is very close to the official Unicode canonical equivalence (when working on a corpus of historical Polish we noticed some cases where standard Unicode equivalence was not convenient). [...] On Thu, Sep 15 2016 at 21:27 CEST, leobo...@namakajiri.net writes: > Isn't the Swift "character" and the "textel" merely the same thing as > what Unicode already named "grapheme clusters"? As for the Swift "character", perhaps someone fluent in Swift will answer the question? > (Well, technically UAX > #29[1] defines them as "user-perceived characters", but then says > grapheme clusters approximate user-perceived characters > algorithmically). > > And, indeed, Swift "Characters" are explicitly defined as "extended > grapheme clusters" (also from UAX #29): > > https://developer.apple.com/library/content/documentation/Swift/Conceptual/Swift_Programming_Language/StringsAndCharacters.html Thank you very much for the link. Let me quote the relevant fragment: --8<---cut here---start->8--- Extended Grapheme Clusters Every instance of Swift’s Character type represents a single extended grapheme cluster. An extended grapheme cluster is a sequence of one or more Unicode scalars that (when combined) produce a single human-readable character. Here’s an example. The letter é can be represented as the single Unicode scalar é (LATIN SMALL LETTER E WITH ACUTE, or U+00E9). However, the same letter can also be represented as a pair of scalars—a standard letter e (LATIN SMALL LETTER E, or U+0065), followed by the COMBINING ACUTE ACCENT scalar (U+0301). The COMBINING ACUTE ACCENT scalar is graphically applied to the scalar that precedes it, turning an e into an é when it is rendered by a Unicode-aware text-rendering system. In both cases, the letter é is represented as a single Swift Character value that represents an extended grapheme cluster. In the first case, the cluster contains a single scalar; in the second case, it is a cluster of two scalars: [...] *Two String values (or two Character values) are considered equal if their extended grapheme clusters are canonically equivalent.* --8<---cut here---end--->8--- For me it means that Swift's characters are equivalence classes of the set of extended grapheme clusters by canonical equivalence relation. > Such a notion is indeed needed, but it has been always there. > > [1] http://unicode.org/reports/tr29/ I don't see there a notion of such equivalent classes. On Thu, Sep 15 2016 at 16:36 CEST, john.w.kenn...@gmail.com writes: [...] > In the new Swift programming language, which is white-hot in the Apple > community, Apple is moving toward a model of a transparent, generic > Unicode that can be “viewed” as UTF-8, UTF-16, or UTF-32 if necessary, > but in which a “character” contains however many code points it needs > (“e” with a stacked macron, acute accent, and dieresis is > algorithmically one “character” in Swift). Moreover, > e-with-an-acute-accent and e followed by a combining acute accent, for > example, compare as equal. At present, the underlying code is still > UTF-16LE. If you insist that Swift's "character" are just grapheme clusters, than you add different, although related, meaning to the term "grapheme cluster". I think the notion deserves a term of its own. Best regards Janusz -- , Prof. dr hab. Janusz S. Bien - Uniwersytet Warszawski (Katedra Lingwistyki Formalnej) Prof. Janusz S. Bien - University of Warsaw (Formal Linguistics Department) jsb...@uw.edu.pl, jsb...@mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/
Re: "textels"
> Date: Fri, 16 Sep 2016 10:25:53 +0100 (BST) > From: William_J_G Overington> > jsb...@mimuw.edu.pl wrote: > > > On Thu, Sep 15 2016 at 21:27 CEST, e...@gnu.org writes: > > [...] > > >> Isn't "grapheme cluster" the definition you are looking for? > > > I don't think so. > > Is an example of a textel that would definitely not be a grapheme cluster be > when a character is expressed as a BASE CHARACTER character followed by one > or more TAG CHARACTER characters. Since no formal definition of a "textel" was presented, except via an example, it's not clear to me whether what you propose can be a textel. (I also don't quite understand the semantics of a base character followed by tag characters, to say the truth.)
Re: "textels"
2016-09-15 21:56 GMT+02:00 Janusz S. Bień: > On Thu, Sep 15 2016 at 21:27 CEST, e...@gnu.org writes: > > [...] > > > Isn't "grapheme cluster" the definition you are looking for? > > I don't think so. > > However: > > 1. Graphemes, if I understand correctly, are language dependent, textels > are not. > Your definition of textels is also language dependant, as you are reading it from a Polish point of view. However you are confusing here "graphemes" with "grapheme clusters". Your (Polish) textels are in fact the same as the (Polish) grapheme clusters. Unicode also defines "default grapheme clusters" that are "grapheme clusters" not tailored for a particular language. A "default grapheme clusters" is the minimum unbreakable unit that can be seen as a valid "grapheme cluster" in most languages (or at least in most languages using the same base script if the script is used in that language; in other scripts, it just provides a minimum compatibility level to allow insertion of foreign texts in a multilingual document). The grapheme clusters can then be used to parse text and apply various processes such as - normalization : grapheme clusters are not broken by it and can be compared for canonical equivalences (but you can compare smaller units using only the combining class property by breaking text on characters with CC=0 and handling the special algorithmic case of modern Hangul syllables; see the Unicode standard about normalization) - BiDi layout - line breaking - word breaking - most standard text transforms (such as case folding) - transliteration Rendering text however often requires larger units as successive grapheme clusters (if not split by a line break or by BiDi reoredring) will interact visually to create more complex layouts (notably in Indic scripts), glued together by some controls (notably joining controls); they are also compelxified in some cases where combining classes alone cannot properly represent these interactions. Additionnally for a few cases, the visual order is used for encoding text instead of the standard model using the logical order: this was made to preserve the roundtrip compatibility between Unicode and legacy encodings widely used (notably for the Thai script). However this has a known caveat (which already existed before Unicode) for some algorithms such as word breaking (implementaitons need to implement a lookup dictionnary, but in Thai this dictionnary is not very large) and line breaking (if we don't want to break words or in the middle oif syllables). The default grapheme clusters however will correctly break the text to allow Thai text (encoded in visual order) to be rendered correctly. In summary, the concept of "grapheme clusters" must be read and understood in the Unicode standard only as a Unicode terminology used to describe all other algorithms described in the standard. They are not bound to a particular language except if thsi language is explicitly specified with this term in that case we won't be handling the "default grapheme clusters" rules but the additional rules tailoring the basic rules used to define the default grapheme clusters. The "extended grapheme clusters" are used in context requiring more complex algorithms that need to group several grapheme clusters in a ordered sequence. These algorithms require some text buffering, and parsing from a random position in text may require looking backward on larger lengths to determine the context. Parsing text sequentially also requires keeping some additional context variables. Plain text searches based on "extended grapheme clusters" is also much more challenging than searches on "default grapheme clusters". For these reasons, the "extended grapheme clusters" are not defined in "default grapheme clusters" but will be needed for matching user expectations in particular languages or scripts. You normally don't need any "extended grapheme clusters" in Polish, except in multilingual documents that are embedding some non-Latin scripts, or some technical notations. > 2. Textel "ń" means both U+0144 and , so it is a notion > on a higher abstraction level then a grapheme cluster. > > Moreover I don't want to call (LATIN SMALL LETTER N, > COMBINING ACUTE ACCENT) an extended grapheme cluster for at least 2 > reasons: > > 1. there is nothing extended in it > This combination is first a "grapheme cluster", before being also an "extended grapheme cluster" in Unicode terminology. The term "extended" comes from an extension added not for the case of combining chacters encoded after base characters (or combined to them in a canonically equivalent string), but for other extensions, notably for complex syllabic constructs: Every "grapheme cluster" may also be an "extended grapheme cluster", but the reverse is NOT true. You have to read the standard about the various kind of text breaking processes. > 2. U+0301 is not a
Re: "textels"
On Thu, Sep 15 2016 at 21:27 CEST, e...@gnu.org writes: [...] > Isn't "grapheme cluster" the definition you are looking for? I don't think so. On Thu, Sep 15 2016 at 21:27 CEST, leobo...@namakajiri.net writes: > Isn't the Swift "character" and the "textel" merely the same thing as > what Unicode already named "grapheme clusters"? (Well, technically UAX > #29[1] defines them as "user-perceived characters", but then says > grapheme clusters approximate user-perceived characters > algorithmically). > > And, indeed, Swift "Characters" are explicitly defined as "extended > grapheme clusters" (also from UAX #29): > > https://developer.apple.com/library/content/documentation/Swift/Conceptual/Swift_Programming_Language/StringsAndCharacters.html > > Such a notion is indeed needed, but it has been always there. > > [1] http://unicode.org/reports/tr29/ Perhaps I don't understand properly the rather obscure definitions, like An extended grapheme cluster is the same as a legacy grapheme cluster, with the addition of some other characters. However: 1. Graphemes, if I understand correctly, are language dependent, textels are not. 2. Textel "ń" means both U+0144 and , so it is a notion on a higher abstraction level then a grapheme cluster. Moreover I don't want to call (LATIN SMALL LETTER N, COMBINING ACUTE ACCENT) an extended grapheme cluster for at least 2 reasons: 1. there is nothing extended in it 2. U+0301 is not a grapheme according to Polish linguistics terminology Regards Janusz -- , Prof. dr hab. Janusz S. Bien - Uniwersytet Warszawski (Katedra Lingwistyki Formalnej) Prof. Janusz S. Bien - University of Warsaw (Formal Linguistics Department) jsb...@uw.edu.pl, jsb...@mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/
Re: "textels" (was: Default character encoding for each operating system?)
> From: jsb...@mimuw.edu.pl (Janusz S. Bień) > Date: Thu, 15 Sep 2016 21:12:53 +0200 > Cc: mufi-fonts> > On Thu, Sep 15 2016 at 16:36 CEST, john.w.kenn...@gmail.com writes: > > [...] > > > In the new Swift programming language, which is white-hot in the Apple > > community, Apple is moving toward a model of a transparent, generic > > Unicode that can be “viewed” as UTF-8, UTF-16, or UTF-32 if necessary, > > but in which a “character” contains however many code points it needs > > (“e” with a stacked macron, acute accent, and dieresis is > > algorithmically one “character” in Swift). Moreover, > > e-with-an-acute-accent and e followed by a combining acute accent, for > > example, compare as equal. At present, the underlying code is still > > UTF-16LE. > > For several years I use the name "textel" (text element, in Polish > "tekstel") for such objects. I do it mostly orally in my presentations > for my students, but I used it also in writing e.g. in > http://bc.klf.uw.edu.pl/118/, unfortunately without a proper > definition. Isn't "grapheme cluster" the definition you are looking for?