Re: graphemes (was: "textels")

2016-09-20 Thread Christoph Päper
Julian Bradfield :
> On 2016-09-19, Christoph Päper  wrote:
>> If _encyclopedia, encyclopædia, encyclopaedia_ are all legal spellings of 
>> the same word in a writing system, a useful linguistic definition of 
>> grapheme should ensure that all three variants have the same number of 
>> graphemes.
> 
> Such a bizarre definition, which would also entail "color/colour",
> "fulfill/fulfil", "sulfur/sulphur" having the same number of
> graphemes,

It’s not a bizarre definition at all, but one could also assume two or three 
different writing systems.

> would break the first three of your rules of thumb:

It would, at least partially.

> and the fourth is pretty dodgy, as it usually contradicts the others
> 
>> - … whatever can never be split up by hyphenation.

It’s not phrased well and it does contradict the other rules of thumb sometimes 
indeed, but together they often work reasonably well to separate clear cases 
from questionable ones which are likely to be treated differently by different 
scholars.


Re: graphemes (was: "textels")

2016-09-20 Thread Julian Bradfield
On 2016-09-19, Christoph Päper  wrote:
> If
>
> - encyclopedia
> - encyclopædia
> - encyclopaedia
>
> are all legal spellings of the same word in a writing system, a useful 
> linguistic definition of grapheme should ensure that all three variants have 
> the same number of graphemes.

Such a bizarre definition, which would also entail "color/colour",
"fulfill/fulfil", "sulfur/sulphur" having the same number of
graphemes, would break the first three of your rules of thumb:

> - … whatever goes into a single box in a crossword puzzle.
> - … whatever gets transposed if you reverse a word or generate an anagram.
> - … whatever gets capitalized together in the beginning of a word.

and the fourth is pretty dodgy, as it usually contradicts the others

> - … whatever can never be split up by hyphenation.


-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.



Re: graphemes (was: "textels")

2016-09-20 Thread Christoph Päper
Dalley Mark (South West Commissioning Support) :
> 
> I think the key phrase is "user-perceived". And you don't need to involve 
> complex scripts either.
> 
> For instance as an English-speaking person, I would perceive the "æ" in 
> "encyclopædia" as being two characters (albeit shoved together somewhat). The 
> argument for this is that the word can equally well be rendered as 
> "encyclopaedia".

If

- encyclopedia
- encyclopædia
- encyclopaedia

are all legal spellings of the same word in a writing system, a useful 
linguistic definition of grapheme should ensure that all three variants have 
the same number of graphemes.

Although linguists often prefer minimal pair analysis, there are some rules of 
thumb for what is a grapheme:

- … whatever goes into a single box in a crossword puzzle.
- … whatever gets transposed if you reverse a word or generate an anagram.
- … whatever gets capitalized together in the beginning of a word.
   (Some argue that capitalization operates on characters, not graphemes, 
though.)
- … whatever can never be split up by hyphenation.


RE: graphemes (was: "textels")

2016-09-19 Thread Dalley Mark (South West Commissioning Support)
I think the key phrase is "user-perceived". And you don't need to involve 
complex scripts either.

For instance as an English-speaking person, I would perceive the "æ" in 
"encyclopædia" as being two characters (albeit shoved together somewhat). The 
argument for this is that the word can equally well be rendered as 
"encyclopaedia".

A Danish or Norwegian speaker, on the other hand, would perceive "æ" (as in 
"ære" or "æsj!") as being a single indivisible character.

Mark Dalley

-Original Message-
From: Unicode [mailto:unicode-boun...@unicode.org] On Behalf Of Janusz S. Bien
Sent: 19 September 2016 07:40
To: Christoph Päper
Cc: unicode Unicode Discussion
Subject: graphemes (was: "textels")

On Sun, Sep 18 2016 at 21:40 CEST, christoph.pae...@crissov.de writes:
> Janusz S. Bien :
>> 
>> From the Unicode glossary:
>> 
>>> Grapheme. (1) A minimally distinctive unit of writing in the context of a 
>>> particular writing system.[...] (2) What a user thinks of as a character.
>> 
>>> User-Perceived Character. What everyone thinks of as a character in their 
>>> script.
>> 
>> […] the definitions are language/locale dependent.
>
> A writing system is (usually) language-dependent, a script is not, 
> although some scripts have been used exclusively (or prominently) in a 
> single writing system with a single language.

It depends of course what do you mean exactly by script, and which meaning of 
term is intended in the definition of User-Perceived Character. But "a user" is 
definitely language/locale dependent :-)

> So definition (1) of ‘grapheme’ would be appropriate for linguistics,
> (2) maybe for typography and computer science, but it’Í extremely 
> vague.

I think that 'grapheme' (2) in the present wording is simply incorrect. I 
suspect it is not used in the standard at all.

Searching the Unicode site I found only one use of 'grapheme' alone:

http://www.unicode.org/L2/L2000/00274-N2236-grapheme-joiner.htm

Graphemes are sequences of one or more encoded characters that
correspond to what users think of as characters.

I guess the intention of 'grapheme' (2) was to describe it without any 
reference to computer encoding, which is definitely an extremely difficult task.

Best regards

Janusz


-- 
   ,   
Prof. dr hab. Janusz S. Bien -  Uniwersytet Warszawski (Katedra Lingwistyki 
Formalnej) Prof. Janusz S. Bien - University of Warsaw (Formal Linguistics 
Department) jsb...@uw.edu.pl, jsb...@mimuw.edu.pl, 
http://fleksem.klf.uw.edu.pl/~jsbien/

-Original Message-
From: Unicode [mailto:unicode-boun...@unicode.org] On Behalf Of Janusz S. Bien
Sent: 19 September 2016 07:40
To: Christoph Päper
Cc: unicode Unicode Discussion
Subject: graphemes (was: "textels")

On Sun, Sep 18 2016 at 21:40 CEST, christoph.pae...@crissov.de writes:
> Janusz S. Bien :
>> 
>> From the Unicode glossary:
>> 
>>> Grapheme. (1) A minimally distinctive unit of writing in the context of a 
>>> particular writing system.[...] (2) What a user thinks of as a character.
>> 
>>> User-Perceived Character. What everyone thinks of as a character in their 
>>> script.
>> 
>> […] the definitions are language/locale dependent.
>
> A writing system is (usually) language-dependent, a script is not, 
> although some scripts have been used exclusively (or prominently) in a 
> single writing system with a single language.

It depends of course what do you mean exactly by script, and which meaning of 
term is intended in the definition of User-Perceived Character. But "a user" is 
definitely language/locale dependent :-)

> So definition (1) of ‘grapheme’ would be appropriate for linguistics,
> (2) maybe for typography and computer science, but it’Í extremely 
> vague.

I think that 'grapheme' (2) in the present wording is simply incorrect. I 
suspect it is not used in the standard at all.

Searching the Unicode site I found only one use of 'grapheme' alone:

http://www.unicode.org/L2/L2000/00274-N2236-grapheme-joiner.htm

Graphemes are sequences of one or more encoded characters that
correspond to what users think of as characters.

I guess the intention of 'grapheme' (2) was to describe it without any 
reference to computer encoding, which is definitely an extremely difficult task.

Best regards

Janusz


-- 
   ,   
Prof. dr hab. Janusz S. Bien -  Uniwersytet Warszawski (Katedra Lingwistyki 
Formalnej) Prof. Janusz S. Bien - University of Warsaw (Formal Linguistics 
Department) jsb...@uw.edu.pl, jsb...@mimuw.edu.pl, 
http://fleksem.klf.uw.edu.pl/~jsbien/