Re: "A Programmer's Introduction to Unicode"
Quote/Cytat - William_J_G Overington(Mon 13 Mar 2017 12:24:13 PM CET): Prof. Janusz S. Bień wrote: Just yet another reason for introducing the notion of textel? I opine that it would be a good idea to introduce several new words, of which textel would be one, with each such new word having a precisely-defined meaning so that in precise discussions of programming techniques people could discuss the situation without needing to use any of the words character, code point, grapheme cluster. How many such new words would be needed? In my paper (in Polish) http://bc.klf.uw.edu.pl/480/ I propose also the term "texton" meaning a code point from a specific subset, not yet fully defined, but including at least the components of composite characters. Best regards Janusz -- Prof. dr hab. Janusz S. Bień - Uniwersytet Warszawski (Katedra Lingwistyki Formalnej) Prof. Janusz S. Bień - University of Warsaw (Formal Linguistics Department) jsb...@uw.edu.pl, jsb...@mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/
Re: "A Programmer's Introduction to Unicode"
Quote/Cytat - Richard Wordingham(Sun 12 Mar 2017 09:10:22 PM CET): On Sun, 12 Mar 2017 20:02:28 +0100 "Janusz S. Bien" wrote: If the basic notion has to be referred in a cumbersome way as "extended grapheme cluster" then it is easier to talk about "Unicode characters" despite the fact that they have a rather loose relation to real-life/user-perceived characters. The notion that extended grapheme clusters corresponds to user-perceived characters is also rather dodgy. The idea is not mine, but it appears from time to time on the list in a more or less explicit way. Whereas it may work for French, it is getting very dubious by the time one adds Hebrew cantillation marks or Vedic accentuation. The Thais revolted when their preposed vowels were joined with the following consonant in the same extended grapheme cluster, and Unicode had to revoke that union. Just yet another reason for introducing the notion of textel? Best regards Janusz -- Prof. dr hab. Janusz S. Bień - Uniwersytet Warszawski (Katedra Lingwistyki Formalnej) Prof. Janusz S. Bień - University of Warsaw (Formal Linguistics Department) jsb...@uw.edu.pl, jsb...@mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/
Re: "A Programmer's Introduction to Unicode"
Prof. Janusz S. Bień wrote: > Just yet another reason for introducing the notion of textel? I opine that it would be a good idea to introduce several new words, of which textel would be one, with each such new word having a precisely-defined meaning so that in precise discussions of programming techniques people could discuss the situation without needing to use any of the words character, code point, grapheme cluster. How many such new words would be needed? I remember how in electronics the introduction of the term Hertz to be used instead of cycles per second helped discussions. After the introduction of the term Hertz it became easy to refer to twenty cycles of a fifty Hertz signal without confusion over one's meaning. So introducing several new precisely-defined words now could help lots of discussions in the future. Perhaps, apart from textel, the definitions could be produced first and then people can decide, for each such definition, which new word would be a good word to have that definition. The recent introduction into Unicode of ZWJ sequences for some emoji and the introduction into Unicode of tag sequences applied to a base character does could mean that the introducing of such new words becomes of increasing importance due to the programming implications of those recently introduced techniques. William Overington Monday 13 March 2017
Re: "A Programmer's Introduction to Unicode"
On 3/13/2017 3:31 AM, Janusz S. Bien wrote: Just yet another reason for introducing the notion of textel? The main difference between "textel" and "pixel" is that the unit of processing /displaying text is not uniform and fixed, unlike a pixel. In other words, different operations may need to look at text differently, and I don't mean the trivial case of storage (byte level) vs. any higher level. Correspondingly the discussion of "text element" at least in the early versions of the Unicode Standard, left the particular division of the text into "text elements" unspecified. There are closely related tasks that might demonstrate this. Assume a script where multiple code points make up a syllable, yet that syllable is the intuitive basic unit of reading and writing. One task is cursor placement. For that task, you need to be able to divide *any* text so that the cursor ideally does not get positioned in the middle of a syllalbel. However, the definition of a "syllable" has to allow degenerate and 'defective' cases. Which is which is of no importance, as long as it is possible to find a valid cursor position. The other task would be to assert that a string contains only well-formed syllables. Here, it is crucially necessary to be able to define which syllables are well-formed. Finding divisions in parts of the string that does not contain well-formed syllables is not necessary. You may also find that in some cases, even though the syllable is the basic unit, there may be a need to edit it in ways other than as a unit. Some syllables may have some optional marks, signs or symbols added that may need to be edited or traversed explicitly, while a "core" syllable may be more likely to be a unit. This (or similar) scenarios indicate the impossibility to come to a single, universal definition of a "textel" -- the main reason why this term is of lower utility than "pixel". A./
Re: "A Programmer's Introduction to Unicode"
Quote/Cytat - Asmus Freytag(Mon 13 Mar 2017 06:00:08 PM CET): [...] This (or similar) scenarios indicate the impossibility to come to a single, universal definition of a "textel" -- the main reason why this term is of lower utility than "pixel". I agree that it is impossible to come to a single, universal definition of text elements, but it seems possible to reach a consensus on a kind of the least common denominator of them and call it "textel" or something else. Best regards Janusz -- Prof. dr hab. Janusz S. Bień - Uniwersytet Warszawski (Katedra Lingwistyki Formalnej) Prof. Janusz S. Bień - University of Warsaw (Formal Linguistics Department) jsb...@uw.edu.pl, jsb...@mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/
Re: "A Programmer's Introduction to Unicode"
Do you have examples of AA being split that way (and further reading)? I think I'm aware of what you're talking about, but would love to read more about it. -Manish On Mon, Mar 13, 2017 at 2:47 PM, Richard Wordinghamwrote: > On Mon, 13 Mar 2017 23:10:11 +0200 > Khaled Hosny wrote: > >> But there are many text operations that require access to Unicode code >> points. Take for example text layout, as mapping characters to glyphs >> and back has to operate on code points. The idea that you never need >> to work with code points is too simplistic. > > There are advantages to interpreting and operating on text as though it > were in form NFD. However, there are still cases where one needs > fractions of a character, such as word boundaries in Sanskrit, though I > think the locations are liable to be specified in a language-specific > form. U+093E DEVANAGARI VOWEL SIGN AA can have a word boundary in it > in at least 4 ways. > > Richard.
Re: "A Programmer's Introduction to Unicode"
On Mon, 13 Mar 2017 15:26:00 -0700 Manish Goregaokarwrote: > Do you have examples of AA being split that way (and further reading)? > I think I'm aware of what you're talking about, but would love to read > more about it. Just googling for the three words 'Sanskrit', 'sandhi' and 'resolution' brings up plenty of papers and discussion, e.g. Hellwig's at http://ltc.amu.edu.pl/book/papers/LRL-1.pdf and a multi-author paper at https://www.aclweb.org/anthology/C/C16/C16-1048.pdf. There are even technical terms for before and after. Unsplit text is 'samhita text', and text split into words is 'pada text'. Richard.
Re: "A Programmer's Introduction to Unicode"
Quote/Cytat - J Decker(Mon 13 Mar 2017 06:55:18 PM CET): texel looks to be defined as a graphic element already. TEXture ELement. I'm aware of it, but homonymy/polysemy is something we have to live with. I think there is no risk of confusing texture elements with text elements, despite the fact that 'texture' and 'text' have similar origin. Best regards Janusz -- Prof. dr hab. Janusz S. Bień - Uniwersytet Warszawski (Katedra Lingwistyki Formalnej) Prof. Janusz S. Bień - University of Warsaw (Formal Linguistics Department) jsb...@uw.edu.pl, jsb...@mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/
Re: "A Programmer's Introduction to Unicode"
I liked the Go implementation of character type - a rune type - which is a codepoint. and strings that return runes from by index. https://blog.golang.org/strings Doesn't solve the problem for composited codepoints though... texel looks to be defined as a graphic element already. TEXture ELement. On Mon, Mar 13, 2017 at 10:15 AM, Janusz S. Bienwrote: > Quote/Cytat - Asmus Freytag (Mon 13 Mar 2017 > 06:00:08 PM CET): > > [...] > > This (or similar) scenarios indicate the impossibility to come to a > single, universal definition of a "textel" -- the main reason why this > term is of lower utility than "pixel". > > I agree that it is impossible to come to a single, universal definition > of text elements, but it seems possible to reach a consensus on a kind of > the least common denominator of them and call it "textel" or something else. > > > Best regards > > Janusz > > -- > Prof. dr hab. Janusz S. Bień - Uniwersytet Warszawski (Katedra > Lingwistyki Formalnej) > Prof. Janusz S. Bień - University of Warsaw (Formal Linguistics Department) > jsb...@uw.edu.pl, jsb...@mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~ > jsbien/ > >
Re: "A Programmer's Introduction to Unicode"
On Mon, 13 Mar 2017 20:20:25 -0400 "Mark E. Shoulson"wrote: > Sanskrit external vowel sandhi is comparatively > straightforward (compared to consonant sandhi), and it frequently > loses information. A *or* AA plus I is E; A *or* AA plus U is O (you > need A + O to get AU). Indeed, E can not only be A or AA plus I or II: it can also be E + A. In the latter case avagraha is usual, at least in European practice. (Would that generally be locale sa_Deva_GB?) I'd like advice on modern Indian practice, and on the spacing and syllable division. I've seen a claim that avagraha always belongs with the preceding vowel, but I'm not sure that that rule applies in this case. In a similar fashion, O can -AS + A-, an interesting case of visarga sandhi. However, I'm not sure that one would want to *divide* the E or O. Richard.
Re: "A Programmer's Introduction to Unicode"
A word ending in A *or* AA preceding a word beginning in A *or* AA will all coalesce to a single AA in Sanskrit. That's four possibilities, and that doesn't count a word ending in a consonant preceding a word beginning in AA, which would be written the same. My memory is rusty, so I should actually be looking things up, but I think these are valid constructions: न + अगच्छत् → नागच्छत् न + आगच्छत् → नागच्छत् (and indeed, आगच्छत् is the upasarga आ plus अगच्छत्, so there too the A + AA coalesced.) I should probably find you examples for all the other possibilities. Sanskrit external vowel sandhi is comparatively straightforward (compared to consonant sandhi), and it frequently loses information. A *or* AA plus I is E; A *or* AA plus U is O (you need A + O to get AU). ~mark On 03/13/2017 06:26 PM, Manish Goregaokar wrote: Do you have examples of AA being split that way (and further reading)? I think I'm aware of what you're talking about, but would love to read more about it. -Manish On Mon, Mar 13, 2017 at 2:47 PM, Richard Wordinghamwrote: On Mon, 13 Mar 2017 23:10:11 +0200 Khaled Hosny wrote: But there are many text operations that require access to Unicode code points. Take for example text layout, as mapping characters to glyphs and back has to operate on code points. The idea that you never need to work with code points is too simplistic. There are advantages to interpreting and operating on text as though it were in form NFD. However, there are still cases where one needs fractions of a character, such as word boundaries in Sanskrit, though I think the locations are liable to be specified in a language-specific form. U+093E DEVANAGARI VOWEL SIGN AA can have a word boundary in it in at least 4 ways. Richard.
Re: "A Programmer's Introduction to Unicode"
On Mon, 13 Mar 2017 19:18:00 + Alastair Houghtonwrote: > IMO, returning code points by index is a mistake. It over-emphasises > the importance of the code point, which helps to continue the notion > in some developers’ minds that code points are somehow “characters”. > It also leads to people unnecessarily using UCS-4 as an internal > representation, which seems to have very few advantages in practice > over UTF-16. The problem is that UTF-16 based code can very easily overlook the handling of surrogate pairs, and one very easily get confused over what string lengths mean. Richard.
Re: "A Programmer's Introduction to Unicode"
On 13 Mar 2017, at 17:55, J Deckerwrote: > > I liked the Go implementation of character type - a rune type - which is a > codepoint. and strings that return runes from by index. > https://blog.golang.org/strings IMO, returning code points by index is a mistake. It over-emphasises the importance of the code point, which helps to continue the notion in some developers’ minds that code points are somehow “characters”. It also leads to people unnecessarily using UCS-4 as an internal representation, which seems to have very few advantages in practice over UTF-16. > Doesn't solve the problem for composited codepoints though... > > texel looks to be defined as a graphic element already. TEXture ELement. Yes, but I thought the proposal was “textel”, with the extra “t”. Re-using “texel” would be quite inappropriate; there are certainly people who work on rendering software who would strongly object to that, for very good reasons. I would caution, however, that there’s already a lot of terminology associated with Unicode, perhaps for understandable reasons, but if the word “textel” is going to have a definition that differs from (say) an extended grapheme cluster, I think a great deal of consideration should be given to what exactly that definition should be. We already have “characters”, code units, code points, combining sequences, graphemes, grapheme clusters, extended grapheme clusters and probably other things I’ve missed off that list. Merely adding yet another bit of terminology isn’t going to fix the problem of developers misunderstanding or simply not being aware of the correct terminology or of some aspect of Unicode’s behaviour. Kind regards, Alastair. -- http://alastairs-place.net
Re: "A Programmer's Introduction to Unicode"
On Mon, Mar 13, 2017 at 07:18:00PM +, Alastair Houghton wrote: > On 13 Mar 2017, at 17:55, J Deckerwrote: > > > > I liked the Go implementation of character type - a rune type - which is a > > codepoint. and strings that return runes from by index. > > https://blog.golang.org/strings > > IMO, returning code points by index is a mistake. It over-emphasises > the importance of the code point, which helps to continue the notion > in some developers’ minds that code points are somehow “characters”. > It also leads to people unnecessarily using UCS-4 as an internal > representation, which seems to have very few advantages in practice > over UTF-16. But there are many text operations that require access to Unicode code points. Take for example text layout, as mapping characters to glyphs and back has to operate on code points. The idea that you never need to work with code points is too simplistic. Regards, Khaled
Re: "A Programmer's Introduction to Unicode"
On Mon, 13 Mar 2017 23:10:11 +0200 Khaled Hosnywrote: > But there are many text operations that require access to Unicode code > points. Take for example text layout, as mapping characters to glyphs > and back has to operate on code points. The idea that you never need > to work with code points is too simplistic. There are advantages to interpreting and operating on text as though it were in form NFD. However, there are still cases where one needs fractions of a character, such as word boundaries in Sanskrit, though I think the locations are liable to be specified in a language-specific form. U+093E DEVANAGARI VOWEL SIGN AA can have a word boundary in it in at least 4 ways. Richard.