Re: "textels"

2016-09-20 Thread Janusz S. Bień
On Tue, Sep 20 2016 at 18:09 CEST, d...@ewellic.org writes:
> Janusz Bień wrote:
>
>> For me it means that Swift's characters are equivalence classes of the
>> set of extended grapheme clusters by canonical equivalence relation.
>
> I still hope we can come to some conclusion on the correct Unicode name
> for this concept. I don't think non-Unicode interpretations of terms
> like "grapheme" are grounds for throwing out "grapheme cluster,"

I agree.

> but I can see that the equivalence class itself is lacking a name.

I'glad.

>
> Note that the Swift definition doesn't say that <00E9> and <0065 0301>
> are identical entities, only that the language compares them as equal.

I'm fully aware of this.

Best regards

Janusz

-- 
   ,   
Prof. dr hab. Janusz S. Bien -  Uniwersytet Warszawski (Katedra Lingwistyki 
Formalnej)
Prof. Janusz S. Bien - University of Warsaw (Formal Linguistics Department)
jsb...@uw.edu.pl, jsb...@mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/



Re: "textels"

2016-09-20 Thread Doug Ewell
Janusz Bień wrote:

> For me it means that Swift's characters are equivalence classes of the
> set of extended grapheme clusters by canonical equivalence relation.

I still hope we can come to some conclusion on the correct Unicode name
for this concept. I don't think non-Unicode interpretations of terms
like "grapheme" are grounds for throwing out "grapheme cluster," but I
can see that the equivalence class itself is lacking a name.

Note that the Swift definition doesn't say that <00E9> and <0065 0301>
are identical entities, only that the language compares them as equal.

--
Doug Ewell | Thornton, CO, US | ewellic.org



Re: "textels"

2016-09-18 Thread Christoph Päper
Janusz S. Bien :
> 
> From the Unicode glossary:
> 
>> Grapheme. (1) A minimally distinctive unit of writing in the context of a 
>> particular writing system.[...] (2) What a user thinks of as a character.
> 
>> User-Perceived Character. What everyone thinks of as a character in their 
>> script.
> 
> […] the definitions are language/locale dependent.

A writing system is (usually) language-dependent, a script is not, although 
some scripts have been used exclusively (or prominently) in a single writing 
system with a single language. So definition (1) of ‘grapheme’ would be 
appropriate for linguistics, (2) maybe for typography and computer science, but 
it’Í extremely vague.


Re: "textels"

2016-09-18 Thread Janusz S. Bien
Quote/Cytat - Christoph Päper  (pią, 16  
wrz 2016, 23:51:38):



Janusz S. Bień :


1. Graphemes, if I understand correctly, are language dependent, …


That’s true in linguistic terminology – well, at least within the  
more popular schools of thought –, but not in technical (i.e.  
Unicode) jargon.


From the Unicode glossary:

Grapheme. (1) A minimally distinctive unit of writing in the context  
of a particular writing system.[...] (2) What a user thinks of as a  
character.


As for (2), cf.

User-Perceived Character. What everyone thinks of as a character in  
their script.


So we have "a user" versus "everyone...in their script" - is the  
difference intentional? Probably not. Anyway the definitions are  
language/locale dependent.


Regards

Janusz

--
Prof. dr hab. Janusz S. Bień -  Uniwersytet Warszawski (Katedra  
Lingwistyki Formalnej)

Prof. Janusz S. Bień - University of Warsaw (Formal Linguistics Department)
jsb...@uw.edu.pl, jsb...@mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/



Re: "textels"

2016-09-16 Thread Christoph Päper
Janusz S. Bień :
> 
> 1. Graphemes, if I understand correctly, are language dependent, …

That’s true in linguistic terminology – well, at least within the more popular 
schools of thought –, but not in technical (i.e. Unicode) jargon.


Re: "textels"

2016-09-16 Thread Janusz S. Bien
Quote/Cytat - Eric Muller  (pią, 16 wrz 2016,  
17:47:27):



On 9/16/2016 8:30 AM, Janusz S. Bien wrote:
Quote/Cytat - Eric Muller  (pią, 16 wrz  
2016, 17:03:54):



On 9/16/2016 6:52 AM, Janusz S. Bień wrote:

(when working on a corpus of historical Polish we
noticed some cases where standard Unicode equivalence was not
convenient).


I'm very interested to know more about those cases.


For our search engine we were unable to use compatibility  
equivalence "out of the box" for splitting the ligature because it  
also converted long s to short s while we wanted to preserve the  
distinction.


I am interested in the problems with *canonical* equivalence. I  
thought that you were talking about those before.


I apologize for the confusion, that was my fault. I tend to answer too  
quickly and not precisely enough :-(


On the other hand I'm not sure canonical equivalence is always what I  
want and expect, but I don't have specific examples at hand.


Regards

Janusz

--
Prof. dr hab. Janusz S. Bień -  Uniwersytet Warszawski (Katedra  
Lingwistyki Formalnej)

Prof. Janusz S. Bień - University of Warsaw (Formal Linguistics Department)
jsb...@uw.edu.pl, jsb...@mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/



Re: "textels"

2016-09-16 Thread Eric Muller

On 9/16/2016 8:30 AM, Janusz S. Bien wrote:
Quote/Cytat - Eric Muller  (pią, 16 wrz 2016, 
17:03:54):



On 9/16/2016 6:52 AM, Janusz S. Bień wrote:

(when working on a corpus of historical Polish we
noticed some cases where standard Unicode equivalence was not
convenient).


I'm very interested to know more about those cases.


For our search engine we were unable to use compatibility equivalence 
"out of the box" for splitting the ligature because it also converted 
long s to short s while we wanted to preserve the distinction.


I am interested in the problems with *canonical* equivalence. I thought 
that you were talking about those before.


Compatibility equivalence is a completely different beast. It is, IMHO, 
too coarse a tool and best forgotten. For any particular task, it's 
typically doing too much (e.g. long/short s folding in your case) and 
too little (not everything you need). There was an attempt at improving 
the situation, by providing a whole bunch of fine grained, targeted 
transformations (http://www.unicode.org/reports/tr30/), but that did not 
pan out.


Eric.



Thanks,
Eric.



Re: "textels"

2016-09-16 Thread Janusz S. Bien
Quote/Cytat - Eric Muller  (pią, 16 wrz 2016,  
17:03:54):



On 9/16/2016 6:52 AM, Janusz S. Bień wrote:

(when working on a corpus of historical Polish we
noticed some cases where standard Unicode equivalence was not
convenient).


I'm very interested to know more about those cases.


For our search engine we were unable to use compatibility equivalence  
"out of the box" for splitting the ligature because it also converted  
long s to short s while we wanted to preserve the distinction.


Regards

Janusz

--
Prof. dr hab. Janusz S. Bień -  Uniwersytet Warszawski (Katedra  
Lingwistyki Formalnej)

Prof. Janusz S. Bień - University of Warsaw (Formal Linguistics Department)
jsb...@uw.edu.pl, jsb...@mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/



Re: "textels"

2016-09-16 Thread William_J_G Overington
>(I also don't quite understand the semantics of a base character followed by 
>tag characters, to say the truth.)

Page 2 of the following document is where the idea was introduced.

http://www.unicode.org/L2/L2015/15145r-add-regional-ind.pdf

The document is linked from the following page.

http://www.unicode.org/L2/L2015/Register-2015.html

William Overington

16 September 2016




Re: "textels"

2016-09-16 Thread William_J_G Overington
jsb...@mimuw.edu.pl wrote:

> On Thu, Sep 15 2016 at 21:27 CEST, e...@gnu.org writes:

[...]

>> Isn't "grapheme cluster" the definition you are looking for?

> I don't think so.

Is an example of a textel that would definitely not be a grapheme cluster be 
when a character is expressed as a BASE CHARACTER character followed by one or 
more TAG CHARACTER characters.

Such a construct was first suggested for some flag characters.

William Overington

16 September 2016




Re: "textels"

2016-09-16 Thread Eric Muller

On 9/16/2016 6:52 AM, Janusz S. Bień wrote:

(when working on a corpus of historical Polish we
noticed some cases where standard Unicode equivalence was not
convenient).


I'm very interested to know more about those cases.

Thanks,
Eric.



Re: "textels"

2016-09-16 Thread Janusz S. Bień
On Thu, Sep 15 2016 at 21:56 CEST, jsb...@mimuw.edu.pl writes:

[...]

> 1. Graphemes, if I understand correctly, are language dependent, textels
> are not.
>
> 2. Textel "ń" means both U+0144 and , so it is a notion
> on a higher abstraction level then a grapheme cluster.

In other words, textels are equivalence classes of some set of Unicode
characters strings by an equivalence relation which at the moment is
open to the discussion but is very close to the official Unicode
canonical equivalence (when working on a corpus of historical Polish we
noticed some cases where standard Unicode equivalence was not
convenient).

[...]


On Thu, Sep 15 2016 at 21:27 CEST, leobo...@namakajiri.net writes:
> Isn't the Swift "character" and the "textel" merely the same thing as
> what Unicode already named "grapheme clusters"?

As for the Swift "character", perhaps someone fluent in Swift will answer
the question?

> (Well, technically UAX
> #29[1] defines them as "user-perceived characters", but then says
> grapheme clusters approximate user-perceived characters
> algorithmically).
>
> And, indeed, Swift "Characters" are explicitly defined as "extended
> grapheme clusters" (also from UAX #29):
>
> https://developer.apple.com/library/content/documentation/Swift/Conceptual/Swift_Programming_Language/StringsAndCharacters.html

Thank you very much for the link. Let me quote the relevant fragment:

--8<---cut here---start->8---
 
Extended Grapheme Clusters

Every instance of Swift’s Character type represents a single extended
grapheme cluster. An extended grapheme cluster is a sequence of one or
more Unicode scalars that (when combined) produce a single
human-readable character.

Here’s an example. The letter é can be represented as the single Unicode
scalar é (LATIN SMALL LETTER E WITH ACUTE, or U+00E9). However, the same
letter can also be represented as a pair of scalars—a standard letter e
(LATIN SMALL LETTER E, or U+0065), followed by the COMBINING ACUTE
ACCENT scalar (U+0301). The COMBINING ACUTE ACCENT scalar is graphically
applied to the scalar that precedes it, turning an e into an é when it
is rendered by a Unicode-aware text-rendering system.

In both cases, the letter é is represented as a single Swift Character
value that represents an extended grapheme cluster. In the first case,
the cluster contains a single scalar; in the second case, it is a
cluster of two scalars:

[...]

*Two String values (or two Character values) are considered equal if
their extended grapheme clusters are canonically equivalent.*

--8<---cut here---end--->8---

For me it means that Swift's characters are equivalence classes of the
set of extended grapheme clusters by canonical equivalence relation.


> Such a notion is indeed needed, but it has been always there.
>
> [1] http://unicode.org/reports/tr29/

I don't see there a notion of such equivalent classes.

On Thu, Sep 15 2016 at 16:36 CEST, john.w.kenn...@gmail.com writes:

[...]

> In the new Swift programming language, which is white-hot in the Apple
> community, Apple is moving toward a model of a transparent, generic
> Unicode that can be “viewed” as UTF-8, UTF-16, or UTF-32 if necessary,
> but in which a “character” contains however many code points it needs
> (“e” with a stacked macron, acute accent, and dieresis is
> algorithmically one “character” in Swift). Moreover,
> e-with-an-acute-accent and e followed by a combining acute accent, for
> example, compare as equal. At present, the underlying code is still
> UTF-16LE.

If you insist that Swift's "character" are just grapheme clusters, than
you add different, although related, meaning to the term "grapheme
cluster". I think the notion deserves a term of its own.

Best regards

Janusz


-- 
   ,   
Prof. dr hab. Janusz S. Bien -  Uniwersytet Warszawski (Katedra Lingwistyki 
Formalnej)
Prof. Janusz S. Bien - University of Warsaw (Formal Linguistics Department)
jsb...@uw.edu.pl, jsb...@mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/



Re: "textels"

2016-09-16 Thread Eli Zaretskii
> Date: Fri, 16 Sep 2016 10:25:53 +0100 (BST)
> From: William_J_G Overington 
> 
> jsb...@mimuw.edu.pl wrote:
> 
> > On Thu, Sep 15 2016 at 21:27 CEST, e...@gnu.org writes:
> 
> [...]
> 
> >> Isn't "grapheme cluster" the definition you are looking for?
> 
> > I don't think so.
> 
> Is an example of a textel that would definitely not be a grapheme cluster be 
> when a character is expressed as a BASE CHARACTER character followed by one 
> or more TAG CHARACTER characters.

Since no formal definition of a "textel" was presented, except via an
example, it's not clear to me whether what you propose can be a
textel.  (I also don't quite understand the semantics of a base
character followed by tag characters, to say the truth.)


Re: "textels"

2016-09-15 Thread Philippe Verdy
2016-09-15 21:56 GMT+02:00 Janusz S. Bień :

> On Thu, Sep 15 2016 at 21:27 CEST, e...@gnu.org writes:
>
> [...]
>
> > Isn't "grapheme cluster" the definition you are looking for?
>
> I don't think so.
>
> However:
>
> 1. Graphemes, if I understand correctly, are language dependent, textels
> are not.
>

Your definition of textels is also language dependant, as you are reading
it from a Polish point of view.
However you are confusing here "graphemes" with "grapheme clusters".

Your (Polish) textels are in fact the same as the (Polish) grapheme
clusters.

Unicode also defines "default grapheme clusters" that are "grapheme
clusters" not tailored for a particular language. A "default grapheme
clusters" is the minimum unbreakable unit that can be seen as a valid
"grapheme cluster" in most languages (or at least in most languages using
the same base script if the script is used in that language; in other
scripts, it just provides a minimum compatibility level to allow insertion
of foreign texts in a multilingual document).

The grapheme clusters can then be used to parse text and apply various
processes such as

  - normalization : grapheme clusters are not broken by it and can be
compared for canonical equivalences (but you can compare smaller units
using only the combining class property by breaking text on characters with
CC=0 and handling the special algorithmic case of modern Hangul syllables;
see the Unicode standard about normalization)
  - BiDi layout
  - line breaking
  - word breaking
  - most standard text transforms (such as case folding)
  - transliteration

Rendering text however often requires larger units as successive grapheme
clusters (if not split by a line break or by BiDi reoredring) will interact
visually to create more complex layouts (notably in Indic scripts), glued
together by some controls (notably joining controls); they are also
compelxified in some cases where combining classes alone cannot properly
represent these interactions.

Additionnally for a few cases, the visual order is used for encoding text
instead of the standard model using the logical order: this was made to
preserve the roundtrip compatibility between Unicode and legacy encodings
widely used (notably for the Thai script). However this has a known caveat
(which already existed before Unicode) for some algorithms such as word
breaking (implementaitons need to implement a lookup dictionnary, but in
Thai this dictionnary is not very large) and line breaking (if we don't
want to break words or in the middle oif syllables). The default grapheme
clusters however will correctly break the text to allow Thai text (encoded
in visual order) to be rendered correctly.

In summary, the concept of "grapheme clusters" must be read and understood
in the Unicode standard only as a Unicode terminology used to describe all
other algorithms described in the standard. They are not bound to a
particular language except if thsi language is explicitly specified with
this term in that case we won't be handling the "default grapheme clusters"
rules but the additional rules tailoring the basic rules used to define the
default grapheme clusters.

The "extended grapheme clusters" are used in context requiring more complex
algorithms that need to group several grapheme clusters in a ordered
sequence. These algorithms require some text buffering, and parsing from a
random position in text may require looking backward on larger lengths to
determine the context. Parsing text sequentially also requires keeping some
additional context variables. Plain text searches based on "extended
grapheme clusters" is also much more challenging than searches on "default
grapheme clusters".

For these reasons, the "extended grapheme clusters" are not defined in
"default grapheme clusters" but will be needed for matching user
expectations in particular languages or scripts. You normally don't need
any "extended grapheme clusters" in Polish, except in multilingual
documents that are embedding some non-Latin scripts, or some technical
notations.



> 2. Textel "ń" means both U+0144 and , so it is a notion
> on a higher abstraction level then a grapheme cluster.
>
> Moreover I don't want to call  (LATIN SMALL LETTER N,
> COMBINING ACUTE ACCENT) an extended grapheme cluster for at least 2
> reasons:
>
> 1. there is nothing extended in it
>

This  combination is first a "grapheme cluster", before
being also an "extended grapheme cluster" in Unicode terminology.

The term "extended" comes from an extension added not for the case of
combining chacters encoded after base characters (or combined to them in a
canonically equivalent string), but for other extensions, notably for
complex syllabic constructs:

Every "grapheme cluster" may also be an "extended grapheme cluster", but
the reverse is NOT true.

You have to read the standard about the various kind of text breaking
processes.


> 2. U+0301 is not a 

Re: "textels"

2016-09-15 Thread Janusz S. Bień
On Thu, Sep 15 2016 at 21:27 CEST, e...@gnu.org writes:

[...]

> Isn't "grapheme cluster" the definition you are looking for?

I don't think so.

On Thu, Sep 15 2016 at 21:27 CEST, leobo...@namakajiri.net writes:
> Isn't the Swift "character" and the "textel" merely the same thing as
> what Unicode already named "grapheme clusters"? (Well, technically UAX
> #29[1] defines them as "user-perceived characters", but then says
> grapheme clusters approximate user-perceived characters
> algorithmically).
>
> And, indeed, Swift "Characters" are explicitly defined as "extended
> grapheme clusters" (also from UAX #29):
>
> https://developer.apple.com/library/content/documentation/Swift/Conceptual/Swift_Programming_Language/StringsAndCharacters.html
>
> Such a notion is indeed needed, but it has been always there.
>
> [1] http://unicode.org/reports/tr29/

Perhaps I don't understand properly the rather obscure definitions, like

An extended grapheme cluster is the same as a legacy grapheme
cluster, with the addition of some other characters.

However:

1. Graphemes, if I understand correctly, are language dependent, textels
are not.

2. Textel "ń" means both U+0144 and , so it is a notion
on a higher abstraction level then a grapheme cluster.

Moreover I don't want to call  (LATIN SMALL LETTER N,
COMBINING ACUTE ACCENT) an extended grapheme cluster for at least 2
reasons:

1. there is nothing extended in it
2. U+0301 is not a grapheme according to Polish linguistics terminology

Regards

Janusz

-- 
   ,   
Prof. dr hab. Janusz S. Bien -  Uniwersytet Warszawski (Katedra Lingwistyki 
Formalnej)
Prof. Janusz S. Bien - University of Warsaw (Formal Linguistics Department)
jsb...@uw.edu.pl, jsb...@mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/



Re: "textels" (was: Default character encoding for each operating system?)

2016-09-15 Thread Eli Zaretskii
> From: jsb...@mimuw.edu.pl (Janusz S. Bień)
> Date: Thu, 15 Sep 2016 21:12:53 +0200
> Cc: mufi-fonts 
> 
> On Thu, Sep 15 2016 at 16:36 CEST, john.w.kenn...@gmail.com writes:
> 
> [...]
> 
> > In the new Swift programming language, which is white-hot in the Apple
> > community, Apple is moving toward a model of a transparent, generic
> > Unicode that can be “viewed” as UTF-8, UTF-16, or UTF-32 if necessary,
> > but in which a “character” contains however many code points it needs
> > (“e” with a stacked macron, acute accent, and dieresis is
> > algorithmically one “character” in Swift). Moreover,
> > e-with-an-acute-accent and e followed by a combining acute accent, for
> > example, compare as equal. At present, the underlying code is still
> > UTF-16LE.
> 
> For several years I use the name "textel" (text element, in Polish
> "tekstel") for such objects. I do it mostly orally in my presentations
> for my students, but I used it also in writing e.g. in
> http://bc.klf.uw.edu.pl/118/, unfortunately without a proper
> definition.

Isn't "grapheme cluster" the definition you are looking for?