Re: "A Programmer's Introduction to Unicode"

2017-03-13 Thread Janusz S. Bien
Quote/Cytat - William_J_G Overington  (Mon  
13 Mar 2017 12:24:13 PM CET):



Prof. Janusz S. Bień wrote:


Just yet another reason for introducing the notion of textel?


I opine that it would be a good idea to introduce several new words,  
of which textel would be one, with each such new word having a  
precisely-defined meaning so that in precise discussions of  
programming techniques people could discuss the situation without  
needing to use any of the words character, code point, grapheme  
cluster.


How many such new words would be needed?


In my paper (in Polish)

http://bc.klf.uw.edu.pl/480/

I propose also the term "texton" meaning a code point from a specific  
subset, not yet fully defined, but including at least the components  
of composite characters.


Best regards

Janusz

--
Prof. dr hab. Janusz S. Bień -  Uniwersytet Warszawski (Katedra  
Lingwistyki Formalnej)

Prof. Janusz S. Bień - University of Warsaw (Formal Linguistics Department)
jsb...@uw.edu.pl, jsb...@mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/



Re: "A Programmer's Introduction to Unicode"

2017-03-13 Thread Janusz S. Bien
Quote/Cytat - Richard Wordingham   
(Sun 12 Mar 2017 09:10:22 PM CET):



On Sun, 12 Mar 2017 20:02:28 +0100
"Janusz S. Bien"  wrote:


If the basic notion has to be referred in a cumbersome way as
"extended grapheme cluster" then it is easier to talk about "Unicode
characters" despite the fact that they have a rather loose relation
to real-life/user-perceived characters.


The notion that extended grapheme clusters corresponds to
user-perceived characters is also rather dodgy.


The idea is not mine, but it appears from time to time on the list in  
a more or less explicit way.



Whereas it may work
for French, it is getting very dubious by the time one adds Hebrew
cantillation marks or Vedic accentuation.  The Thais revolted when
their preposed vowels were joined with the following consonant in the
same extended grapheme cluster, and Unicode had to revoke that union.


Just yet another reason for introducing the notion of textel?

Best regards

Janusz


--
Prof. dr hab. Janusz S. Bień -  Uniwersytet Warszawski (Katedra  
Lingwistyki Formalnej)

Prof. Janusz S. Bień - University of Warsaw (Formal Linguistics Department)
jsb...@uw.edu.pl, jsb...@mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/



Re: "A Programmer's Introduction to Unicode"

2017-03-13 Thread William_J_G Overington
Prof. Janusz S. Bień wrote:

> Just yet another reason for introducing the notion of textel?

I opine that it would be a good idea to introduce several new words, of which 
textel would be one, with each such new word having a precisely-defined meaning 
so that in precise discussions of programming techniques people could discuss 
the situation without needing to use any of the words character, code point, 
grapheme cluster.

How many such new words would be needed?

I remember how in electronics the introduction of the term Hertz to be used 
instead of cycles per second helped discussions.

After the introduction of the term Hertz it became easy to refer to twenty 
cycles of a fifty Hertz signal without confusion over one's meaning.

So introducing several new precisely-defined words now could help lots of 
discussions in the future.

Perhaps, apart from textel, the definitions could be produced first and then 
people can decide, for each such definition, which new word would be a good 
word to have that definition.

The recent introduction into Unicode of ZWJ sequences for some emoji and the 
introduction into Unicode of tag sequences applied to a base character does 
could mean that the introducing of such new words becomes of increasing 
importance due to the programming implications of those recently introduced 
techniques. 

William Overington

Monday 13 March 2017




Re: "A Programmer's Introduction to Unicode"

2017-03-13 Thread Asmus Freytag

  
  
On 3/13/2017 3:31 AM, Janusz S. Bien
  wrote:

Just yet another reason for introducing the notion of
  textel?
  

The main difference between "textel" and "pixel"
is that the unit of processing /displaying text is not uniform
and fixed, unlike a pixel. In other words, different operations
may need to look at text differently, and I don't mean the trivial
case of storage (byte level) vs. any higher level.
Correspondingly the discussion of "text
element" at least in the early versions of the Unicode Standard,
left the particular division of the text into "text elements"
unspecified.
There are closely related tasks that might
demonstrate this. Assume a script where multiple code points
make up a syllable, yet that syllable is the intuitive basic
unit of reading and writing.
  
One task is cursor placement. For that task,
you need to be able to divide *any* text so that the cursor
ideally does not get positioned in the middle of a syllalbel.
However, the definition of a "syllable" has to allow degenerate
and 'defective' cases. Which is which is of no importance, as
long as it is possible to find a valid cursor position.
  
The other task would be to assert that a
string contains only well-formed syllables. Here, it is crucially
necessary to be able to define which syllables are well-formed.
Finding divisions in parts of the string that does not contain
well-formed syllables is not necessary.
You may also find that in some cases, even though
the syllable is the basic unit, there may be a need to edit it
in ways other than as a unit. Some syllables may have some
optional marks, signs or symbols added that may need to be edited
or traversed explicitly, while a "core" syllable may be more likely
to be a unit.
This (or similar) scenarios indicate the
impossibility to come to a single, universal definition of a
"textel" -- the main reason why this term is of lower utility
than "pixel".
A./
  
  



Re: "A Programmer's Introduction to Unicode"

2017-03-13 Thread Janusz S. Bien
Quote/Cytat - Asmus Freytag  (Mon 13 Mar 2017  
06:00:08 PM CET):


[...]

This (or similar) scenarios indicate the impossibility to come to a
single, universal definition of a "textel" -- the main reason why this
term is of lower utility than "pixel".

I agree that it is impossible  to come to a single, universal  
definition of text elements, but it seems possible to reach a  
consensus on a kind of the least common denominator of them and call  
it "textel" or something else.


Best regards

Janusz

--
Prof. dr hab. Janusz S. Bień -  Uniwersytet Warszawski (Katedra  
Lingwistyki Formalnej)

Prof. Janusz S. Bień - University of Warsaw (Formal Linguistics Department)
jsb...@uw.edu.pl, jsb...@mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/



Re: "A Programmer's Introduction to Unicode"

2017-03-13 Thread Manish Goregaokar
Do you have examples of AA being split that way (and further reading)?
I think I'm aware of what you're talking about, but would love to read
more about it.
-Manish


On Mon, Mar 13, 2017 at 2:47 PM, Richard Wordingham
 wrote:
> On Mon, 13 Mar 2017 23:10:11 +0200
> Khaled Hosny  wrote:
>
>> But there are many text operations that require access to Unicode code
>> points. Take for example text layout, as mapping characters to glyphs
>> and back has to operate on code points. The idea that you never need
>> to work with code points is too simplistic.
>
> There are advantages to interpreting and operating on text as though it
> were in form NFD.  However, there are still cases where one needs
> fractions of a character, such as word boundaries in Sanskrit, though I
> think the locations are liable to be specified in a language-specific
> form.  U+093E DEVANAGARI VOWEL SIGN AA can have a word boundary in it
> in at least 4 ways.
>
> Richard.


Re: "A Programmer's Introduction to Unicode"

2017-03-13 Thread Richard Wordingham
On Mon, 13 Mar 2017 15:26:00 -0700
Manish Goregaokar  wrote:

> Do you have examples of AA being split that way (and further reading)?
> I think I'm aware of what you're talking about, but would love to read
> more about it.

Just googling for the three words 'Sanskrit', 'sandhi' and 'resolution'
brings up plenty of papers and discussion, e.g. Hellwig's at
http://ltc.amu.edu.pl/book/papers/LRL-1.pdf and a multi-author paper at
https://www.aclweb.org/anthology/C/C16/C16-1048.pdf.

There are even technical terms for before and after.  Unsplit text is
'samhita text', and text split into words is 'pada text'.

Richard.


Re: "A Programmer's Introduction to Unicode"

2017-03-13 Thread Janusz S. Bien

Quote/Cytat - J Decker  (Mon 13 Mar 2017 06:55:18 PM CET):


texel looks to be defined as a graphic element already.  TEXture ELement.


I'm aware of it, but homonymy/polysemy is something we have to live  
with. I think there is no risk of confusing texture elements with text  
elements, despite the fact that 'texture' and 'text' have similar  
origin.


Best regards

Janusz

--
Prof. dr hab. Janusz S. Bień -  Uniwersytet Warszawski (Katedra  
Lingwistyki Formalnej)

Prof. Janusz S. Bień - University of Warsaw (Formal Linguistics Department)
jsb...@uw.edu.pl, jsb...@mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/



Re: "A Programmer's Introduction to Unicode"

2017-03-13 Thread J Decker
I liked the Go implementation of character type - a rune type - which is a
codepoint.  and strings that return runes from by index.
https://blog.golang.org/strings

Doesn't solve the problem for composited codepoints though...

texel looks to be defined as a graphic element already.  TEXture ELement.



On Mon, Mar 13, 2017 at 10:15 AM, Janusz S. Bien 
wrote:

> Quote/Cytat - Asmus Freytag  (Mon 13 Mar 2017
> 06:00:08 PM CET):
>
> [...]
>
> This (or similar) scenarios indicate the impossibility to come to a
> single, universal definition of a "textel" -- the main reason why this
> term is of lower utility than "pixel".
>
> I agree that it is impossible  to come to a single, universal definition
> of text elements, but it seems possible to reach a consensus on a kind of
> the least common denominator of them and call it "textel" or something else.
>
>
> Best regards
>
> Janusz
>
> --
> Prof. dr hab. Janusz S. Bień -  Uniwersytet Warszawski (Katedra
> Lingwistyki Formalnej)
> Prof. Janusz S. Bień - University of Warsaw (Formal Linguistics Department)
> jsb...@uw.edu.pl, jsb...@mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~
> jsbien/
>
>


Re: "A Programmer's Introduction to Unicode"

2017-03-13 Thread Richard Wordingham
On Mon, 13 Mar 2017 20:20:25 -0400
"Mark E. Shoulson"  wrote:

> Sanskrit external vowel sandhi is comparatively 
> straightforward (compared to consonant sandhi), and it frequently
> loses information.  A *or* AA plus I is E; A *or* AA plus U is O (you
> need A + O to get AU).

Indeed, E can not only be A or AA plus I or II: it can also be E + A.
In the latter case avagraha is usual, at least in European practice.
(Would that generally be locale sa_Deva_GB?) I'd like advice on modern
Indian practice, and on the spacing and syllable division. I've seen a
claim that avagraha always belongs with the preceding vowel, but I'm
not sure that that rule applies in this case.

In a similar fashion, O can -AS + A-, an interesting case of visarga
sandhi. However, I'm not sure that one would want to *divide* the E or
O.

Richard.


Re: "A Programmer's Introduction to Unicode"

2017-03-13 Thread Mark E. Shoulson
A word ending in A *or* AA preceding a word beginning in A *or* AA will 
all coalesce to a single AA in Sanskrit.  That's four possibilities, and 
that doesn't count a word ending in a consonant preceding a word 
beginning in AA, which would be written the same.  My memory is rusty, 
so I should actually be looking things up, but I think these are valid 
constructions:


न + अगच्छत्  →  नागच्छत्
न + आगच्छत्  → नागच्छत्

(and indeed, आगच्छत् is the upasarga आ plus अगच्छत्, so there too the A 
+ AA coalesced.)  I should probably find you examples for all the other 
possibilities.  Sanskrit external vowel sandhi is comparatively 
straightforward (compared to consonant sandhi), and it frequently loses 
information.  A *or* AA plus I is E; A *or* AA plus U is O (you need A + 
O to get AU).


~mark


On 03/13/2017 06:26 PM, Manish Goregaokar wrote:

Do you have examples of AA being split that way (and further reading)?
I think I'm aware of what you're talking about, but would love to read
more about it.
-Manish


On Mon, Mar 13, 2017 at 2:47 PM, Richard Wordingham
 wrote:

On Mon, 13 Mar 2017 23:10:11 +0200
Khaled Hosny  wrote:


But there are many text operations that require access to Unicode code
points. Take for example text layout, as mapping characters to glyphs
and back has to operate on code points. The idea that you never need
to work with code points is too simplistic.

There are advantages to interpreting and operating on text as though it
were in form NFD.  However, there are still cases where one needs
fractions of a character, such as word boundaries in Sanskrit, though I
think the locations are liable to be specified in a language-specific
form.  U+093E DEVANAGARI VOWEL SIGN AA can have a word boundary in it
in at least 4 ways.

Richard.





Re: "A Programmer's Introduction to Unicode"

2017-03-13 Thread Richard Wordingham
On Mon, 13 Mar 2017 19:18:00 +
Alastair Houghton  wrote:

> IMO, returning code points by index is a mistake.  It over-emphasises
> the importance of the code point, which helps to continue the notion
> in some developers’ minds that code points are somehow “characters”.
> It also leads to people unnecessarily using UCS-4 as an internal
> representation, which seems to have very few advantages in practice
> over UTF-16.

The problem is that UTF-16 based code can very easily overlook the
handling of surrogate pairs, and one very easily get confused over what
string lengths mean.

Richard.



Re: "A Programmer's Introduction to Unicode"

2017-03-13 Thread Alastair Houghton
On 13 Mar 2017, at 17:55, J Decker  wrote:
> 
> I liked the Go implementation of character type - a rune type - which is a 
> codepoint.  and strings that return runes from by index.
> https://blog.golang.org/strings

IMO, returning code points by index is a mistake.  It over-emphasises the 
importance of the code point, which helps to continue the notion in some 
developers’ minds that code points are somehow “characters”.  It also leads to 
people unnecessarily using UCS-4 as an internal representation, which seems to 
have very few advantages in practice over UTF-16.

> Doesn't solve the problem for composited codepoints though... 
> 
> texel looks to be defined as a graphic element already.  TEXture ELement.

Yes, but I thought the proposal was “textel”, with the extra “t”.  Re-using 
“texel” would be quite inappropriate; there are certainly people who work on 
rendering software who would strongly object to that, for very good reasons.

I would caution, however, that there’s already a lot of terminology associated 
with Unicode, perhaps for understandable reasons, but if the word “textel” is 
going to have a definition that differs from (say) an extended grapheme 
cluster, I think a great deal of consideration should be given to what exactly 
that definition should be.  We already have “characters”, code units, code 
points, combining sequences, graphemes, grapheme clusters, extended grapheme 
clusters and probably other things I’ve missed off that list.  Merely adding 
yet another bit of terminology isn’t going to fix the problem of developers 
misunderstanding or simply not being aware of the correct terminology or of 
some aspect of Unicode’s behaviour.

Kind regards,

Alastair.

--
http://alastairs-place.net




Re: "A Programmer's Introduction to Unicode"

2017-03-13 Thread Khaled Hosny
On Mon, Mar 13, 2017 at 07:18:00PM +, Alastair Houghton wrote:
> On 13 Mar 2017, at 17:55, J Decker  wrote:
> > 
> > I liked the Go implementation of character type - a rune type - which is a 
> > codepoint.  and strings that return runes from by index.
> > https://blog.golang.org/strings
> 
> IMO, returning code points by index is a mistake.  It over-emphasises
> the importance of the code point, which helps to continue the notion
> in some developers’ minds that code points are somehow “characters”.
> It also leads to people unnecessarily using UCS-4 as an internal
> representation, which seems to have very few advantages in practice
> over UTF-16.

But there are many text operations that require access to Unicode code
points. Take for example text layout, as mapping characters to glyphs
and back has to operate on code points. The idea that you never need to
work with code points is too simplistic.

Regards,
Khaled


Re: "A Programmer's Introduction to Unicode"

2017-03-13 Thread Richard Wordingham
On Mon, 13 Mar 2017 23:10:11 +0200
Khaled Hosny  wrote:
 
> But there are many text operations that require access to Unicode code
> points. Take for example text layout, as mapping characters to glyphs
> and back has to operate on code points. The idea that you never need
> to work with code points is too simplistic.

There are advantages to interpreting and operating on text as though it
were in form NFD.  However, there are still cases where one needs
fractions of a character, such as word boundaries in Sanskrit, though I
think the locations are liable to be specified in a language-specific
form.  U+093E DEVANAGARI VOWEL SIGN AA can have a word boundary in it
in at least 4 ways.

Richard.