Re: graphemes
On Tue, Sep 20 2016 at 10:57 CEST, christoph.pae...@crissov.de writes: > Julian Bradfield: >> On 2016-09-19, Christoph Päper wrote: >>> If _encyclopedia, encyclopædia, encyclopaedia_ are all legal >>> spellings of the same word in a writing system, a useful linguistic >>> definition of grapheme should ensure that all three variants have >>> the same number of graphemes. >> >> Such a bizarre definition, which would also entail "color/colour", >> "fulfill/fulfil", "sulfur/sulphur" having the same number of >> graphemes, > > It’s not a bizarre definition at all, but one could also assume two or three > different writing systems. > >> would break the first three of your rules of thumb: > > It would, at least partially. > >> and the fourth is pretty dodgy, as it usually contradicts the others >> >>> - … whatever can never be split up by hyphenation. > > It’s not phrased well and it does contradict the other rules of thumb > sometimes indeed, but together they often work reasonably well to > separate clear cases from questionable ones which are likely to be > treated differently by different scholars. Let me remind the issues which started the thread: On Sun, Sep 18 2016 at 12:26 CEST, jsb...@mimuw.edu.pl writes: > Quote/Cytat - Christoph Päper (pią, 16 > wrz 2016, 23:51:38): > >> Janusz S. Bień : >>> >>> 1. Graphemes, if I understand correctly, are language dependent, … >> >> That’s true in linguistic terminology – well, at least within the >> more popular schools of thought –, but not in technical (i.e. >> Unicode) jargon. And what is "grapheme" in "technical (i.e. Unicode) jargon"? > > From the Unicode glossary: > > Grapheme. (1) A minimally distinctive unit of writing in the context > of a particular writing system.[...] (2) What a user thinks of as a > character. > > As for (2), cf. > > User-Perceived Character. What everyone thinks of as a character in > their script. > > So we have "a user" versus "everyone...in their script" - is the > difference intentional? Probably not. Anyway the definitions are > language/locale dependent. Does 'Grapheme' (2) make sense with "a (single?) user"? BTW, it is rather well know that the term "phoneme" was proposed first by a Polish linguist Jan Niecisław Ignacy Baudouin de Courtenay (13 March 1845 – 3 November 1929), cf. e.g https://en.wikipedia.org/wiki/Jan_Baudouin_de_Courtenay. It is much less know that he proposed also the term "grapheme". Let me quote Alexander Berg's "English Historical Linguistics vol. I" page 230 from Google Books: Since the introduction of the term grapheme by Baudouin de Courtenay in 1901 (Ruszkiewicz 1976:24-37, 1981 [1978], 20-34), it has been defined in various ways: [...] As can be seen from these quotatioms, the available definitions can be divided into two groups, corresponding to two main senses, and reflecting "conflicting linguistics views of the status of writing" (Henderson 1985:142): 1. a letter or cluster of letters referring to or corresponding with a single phoneme; 2. the minimal distinctive unit of a writing system. For me the first meaning (not mentioned at all in English Wikipedia) is the primary, i.e. more useful, meaning, as is has some practical applications e.g. for describing Polish hyphenation rules. Best regards Janusz -- , Prof. dr hab. Janusz S. Bien - Uniwersytet Warszawski (Katedra Lingwistyki Formalnej) Prof. Janusz S. Bien - University of Warsaw (Formal Linguistics Department) jsb...@uw.edu.pl, jsb...@mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/
Re: "textels"
On Tue, Sep 20 2016 at 18:09 CEST, d...@ewellic.org writes: > Janusz Bień wrote: > >> For me it means that Swift's characters are equivalence classes of the >> set of extended grapheme clusters by canonical equivalence relation. > > I still hope we can come to some conclusion on the correct Unicode name > for this concept. I don't think non-Unicode interpretations of terms > like "grapheme" are grounds for throwing out "grapheme cluster," I agree. > but I can see that the equivalence class itself is lacking a name. I'glad. > > Note that the Swift definition doesn't say that <00E9> and <0065 0301> > are identical entities, only that the language compares them as equal. I'm fully aware of this. Best regards Janusz -- , Prof. dr hab. Janusz S. Bien - Uniwersytet Warszawski (Katedra Lingwistyki Formalnej) Prof. Janusz S. Bien - University of Warsaw (Formal Linguistics Department) jsb...@uw.edu.pl, jsb...@mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/
Re: Dataset for all ISO639 code sorted by country/territory?
Mats Blakstad wrote: > Is there any dataset that contains all languages in the world sorted > by country/territory? As others have pointed out, be careful about how slippery this slope can get. Everyone has his or her own opinion about how many speakers of Language X in country Y need to be identified, estimated, or conjectured in order to say that "language X is spoken in country Y." > I manage to find a dataset on the website of Ethnologue, though it > doesn't look like open source, need to check with them exactly how I'm > allowed to use it: > http://www.ethnologue.com/codes/download-code-tables The readme file included in the downloadable zip file makes SIL's terms very clear. Basically you need to credit SIL as the source of the data, not change it, and not make the data directly available for others to download. It's best not to get caught up in "open source" as if any other terms would make the data totally unusable. -- Doug Ewell | Thornton, CO, US | ewellic.org
Re: "textels"
Janusz Bień wrote: > For me it means that Swift's characters are equivalence classes of the > set of extended grapheme clusters by canonical equivalence relation. I still hope we can come to some conclusion on the correct Unicode name for this concept. I don't think non-Unicode interpretations of terms like "grapheme" are grounds for throwing out "grapheme cluster," but I can see that the equivalence class itself is lacking a name. Note that the Swift definition doesn't say that <00E9> and <0065 0301> are identical entities, only that the language compares them as equal. -- Doug Ewell | Thornton, CO, US | ewellic.org
Re: graphemes
On 9/20/2016 12:30 AM, Julian Bradfield wrote: are all legal spellings of the same word in a writing system, a useful linguistic definition of grapheme should ensure that all three variants have the same number of graphemes. Such a bizarre definition, which would also entail "color/colour", "fulfill/fulfil", "sulfur/sulphur" having the same number of graphemes, would break the first three of your rules of thumb: I agree with Julian here. Consider also similar common alternations as night/nite, light/lite which are widespread *within* American English spelling conventions and don't even raise questions of locale differences. Or you/u, your/ur, which vary on another dimension. If every variation in spelling is taken to constitute a distinct writing system, simply to preserve the concept of a "grapheme", we would be led to conclude that American English has millions of writing systems, because of the combinatorics involved. And the caveat that it is a "legal" spelling is a hinky dodge, particularly in the case of English. There isn't any recognized legal framework for English spelling. English, she is spelled how people decide to spell her -- or perhaps mostly how 2nd grade English teachers decide she is spelled. Even where legal or academic frameworks exist to formally control the spelling rules of a language, one should be leery that such rules somehow instantiate the identity of graphemes, which are unlikely to be the principal matter of concern for those trying to establish the spelling rules in the first place. --Ken
Re: graphemes (was: "textels")
Julian Bradfield: > On 2016-09-19, Christoph Päper wrote: >> If _encyclopedia, encyclopædia, encyclopaedia_ are all legal spellings of >> the same word in a writing system, a useful linguistic definition of >> grapheme should ensure that all three variants have the same number of >> graphemes. > > Such a bizarre definition, which would also entail "color/colour", > "fulfill/fulfil", "sulfur/sulphur" having the same number of > graphemes, It’s not a bizarre definition at all, but one could also assume two or three different writing systems. > would break the first three of your rules of thumb: It would, at least partially. > and the fourth is pretty dodgy, as it usually contradicts the others > >> - … whatever can never be split up by hyphenation. It’s not phrased well and it does contradict the other rules of thumb sometimes indeed, but together they often work reasonably well to separate clear cases from questionable ones which are likely to be treated differently by different scholars.