Re: graphemes

2016-09-20 Thread Janusz S. Bień
On Tue, Sep 20 2016 at 10:57 CEST, christoph.pae...@crissov.de writes:
> Julian Bradfield :
>> On 2016-09-19, Christoph Päper  wrote:
>>> If _encyclopedia, encyclopædia, encyclopaedia_ are all legal
>>> spellings of the same word in a writing system, a useful linguistic
>>> definition of grapheme should ensure that all three variants have
>>> the same number of graphemes.
>> 
>> Such a bizarre definition, which would also entail "color/colour",
>> "fulfill/fulfil", "sulfur/sulphur" having the same number of
>> graphemes,
>
> It’s not a bizarre definition at all, but one could also assume two or three 
> different writing systems.
>
>> would break the first three of your rules of thumb:
>
> It would, at least partially.
>
>> and the fourth is pretty dodgy, as it usually contradicts the others
>> 
>>> - … whatever can never be split up by hyphenation.
>
> It’s not phrased well and it does contradict the other rules of thumb
> sometimes indeed, but together they often work reasonably well to
> separate clear cases from questionable ones which are likely to be
> treated differently by different scholars.

Let me remind the issues which started the thread:


On Sun, Sep 18 2016 at 12:26 CEST, jsb...@mimuw.edu.pl writes:
> Quote/Cytat - Christoph Päper  (pią, 16
> wrz 2016, 23:51:38):
>
>> Janusz S. Bień :
>>>
>>> 1. Graphemes, if I understand correctly, are language dependent, …
>>
>> That’s true in linguistic terminology – well, at least within the
>> more popular schools of thought –, but not in technical (i.e.
>> Unicode) jargon.

And what is "grapheme" in "technical (i.e. Unicode) jargon"?

>
> From the Unicode glossary:
>
> Grapheme. (1) A minimally distinctive unit of writing in the context
> of a particular writing system.[...] (2) What a user thinks of as a
> character.
>
> As for (2), cf.
>
> User-Perceived Character. What everyone thinks of as a character in
> their script.
>
> So we have "a user" versus "everyone...in their script" - is the
> difference intentional? Probably not. Anyway the definitions are
> language/locale dependent.

Does 'Grapheme' (2) make sense with "a (single?) user"? 

BTW, it is rather well know that the term "phoneme" was proposed first
by a Polish linguist Jan Niecisław Ignacy Baudouin de Courtenay (13
March 1845 – 3 November 1929), cf. e.g
https://en.wikipedia.org/wiki/Jan_Baudouin_de_Courtenay.  It is much
less know that he proposed also the term "grapheme". Let me quote
Alexander Berg's "English Historical Linguistics vol. I" page 230 from
Google Books:

   Since the introduction of the term grapheme by Baudouin de
   Courtenay in 1901 (Ruszkiewicz 1976:24-37, 1981 [1978], 20-34),
   it has been defined in various ways:

   [...]

   As can be seen from these quotatioms, the available definitions
   can be divided into two groups, corresponding to two main senses,
   and reflecting "conflicting linguistics views of the status of
   writing" (Henderson 1985:142):

   1. a letter or cluster of letters referring to or corresponding with a
   single phoneme;

   2. the minimal distinctive unit of a writing system.

For me the first meaning (not mentioned at all in English Wikipedia) is
the primary, i.e. more useful, meaning, as is has some practical
applications e.g. for describing Polish hyphenation rules.

Best regards

Janusz

-- 
   ,   
Prof. dr hab. Janusz S. Bien -  Uniwersytet Warszawski (Katedra Lingwistyki 
Formalnej)
Prof. Janusz S. Bien - University of Warsaw (Formal Linguistics Department)
jsb...@uw.edu.pl, jsb...@mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/



Re: "textels"

2016-09-20 Thread Janusz S. Bień
On Tue, Sep 20 2016 at 18:09 CEST, d...@ewellic.org writes:
> Janusz Bień wrote:
>
>> For me it means that Swift's characters are equivalence classes of the
>> set of extended grapheme clusters by canonical equivalence relation.
>
> I still hope we can come to some conclusion on the correct Unicode name
> for this concept. I don't think non-Unicode interpretations of terms
> like "grapheme" are grounds for throwing out "grapheme cluster,"

I agree.

> but I can see that the equivalence class itself is lacking a name.

I'glad.

>
> Note that the Swift definition doesn't say that <00E9> and <0065 0301>
> are identical entities, only that the language compares them as equal.

I'm fully aware of this.

Best regards

Janusz

-- 
   ,   
Prof. dr hab. Janusz S. Bien -  Uniwersytet Warszawski (Katedra Lingwistyki 
Formalnej)
Prof. Janusz S. Bien - University of Warsaw (Formal Linguistics Department)
jsb...@uw.edu.pl, jsb...@mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/



Re: Dataset for all ISO639 code sorted by country/territory?

2016-09-20 Thread Doug Ewell
Mats Blakstad wrote:

> Is there any dataset that contains all languages in the world sorted
> by country/territory?

As others have pointed out, be careful about how slippery this slope can
get. Everyone has his or her own opinion about how many speakers of
Language X in country Y need to be identified, estimated, or conjectured
in order to say that "language X is spoken in country Y."

> I manage to find a dataset on the website of Ethnologue, though it
> doesn't look like open source, need to check with them exactly how I'm
> allowed to use it:
> http://www.ethnologue.com/codes/download-code-tables

The readme file included in the downloadable zip file makes SIL's terms
very clear. Basically you need to credit SIL as the source of the data,
not change it, and not make the data directly available for others to
download. It's best not to get caught up in "open source" as if any
other terms would make the data totally unusable.

--
Doug Ewell | Thornton, CO, US | ewellic.org




Re: "textels"

2016-09-20 Thread Doug Ewell
Janusz Bień wrote:

> For me it means that Swift's characters are equivalence classes of the
> set of extended grapheme clusters by canonical equivalence relation.

I still hope we can come to some conclusion on the correct Unicode name
for this concept. I don't think non-Unicode interpretations of terms
like "grapheme" are grounds for throwing out "grapheme cluster," but I
can see that the equivalence class itself is lacking a name.

Note that the Swift definition doesn't say that <00E9> and <0065 0301>
are identical entities, only that the language compares them as equal.

--
Doug Ewell | Thornton, CO, US | ewellic.org



Re: graphemes

2016-09-20 Thread Ken Whistler

On 9/20/2016 12:30 AM, Julian Bradfield wrote:


are all legal spellings of the same word in a writing system, a useful 
linguistic definition of grapheme should ensure that all three variants have 
the same number of graphemes.

Such a bizarre definition, which would also entail "color/colour",
"fulfill/fulfil", "sulfur/sulphur" having the same number of
graphemes, would break the first three of your rules of thumb:



I agree with Julian here. Consider also similar common alternations as 
night/nite, light/lite which are widespread *within* American English 
spelling conventions and don't even raise questions of locale 
differences. Or you/u,  your/ur, which vary on another dimension. If 
every variation in spelling is taken to constitute a distinct writing 
system, simply to preserve the concept of a "grapheme", we would be led 
to conclude that American English has millions of writing systems, 
because of the combinatorics involved.


And the caveat that it is a "legal" spelling is a hinky dodge, 
particularly in the case of English. There isn't any recognized legal 
framework for English spelling. English, she is spelled how people 
decide to spell her -- or perhaps mostly how 2nd grade English teachers 
decide she is spelled.


Even where legal or academic frameworks exist to formally control the 
spelling rules of a language, one should be leery that such rules 
somehow instantiate the identity of graphemes, which are unlikely to be 
the principal matter of concern for those trying to establish the 
spelling rules in the first place.


--Ken



Re: graphemes (was: "textels")

2016-09-20 Thread Christoph Päper
Julian Bradfield :
> On 2016-09-19, Christoph Päper  wrote:
>> If _encyclopedia, encyclopædia, encyclopaedia_ are all legal spellings of 
>> the same word in a writing system, a useful linguistic definition of 
>> grapheme should ensure that all three variants have the same number of 
>> graphemes.
> 
> Such a bizarre definition, which would also entail "color/colour",
> "fulfill/fulfil", "sulfur/sulphur" having the same number of
> graphemes,

It’s not a bizarre definition at all, but one could also assume two or three 
different writing systems.

> would break the first three of your rules of thumb:

It would, at least partially.

> and the fourth is pretty dodgy, as it usually contradicts the others
> 
>> - … whatever can never be split up by hyphenation.

It’s not phrased well and it does contradict the other rules of thumb sometimes 
indeed, but together they often work reasonably well to separate clear cases 
from questionable ones which are likely to be treated differently by different 
scholars.