Re: 0027, 02BC, 2019, or a new character?

2018-01-22 Thread James Kass via Unicode
For me, having to go around justifying my whims would probably take
some of the fun out of being an authoritarian ruler.

Which suggests that the apostrophe decision can be revised with no
explanation expected, even though a simple explanation exists.
Changing from the apostrophe to the combining acute accent above is,
after all, essentially turning the apostrophe at a slight angle and
writing it above the letter it modifies.  This would not represent a
reversed decision, simply a change in the style in which the already
selected modifier is written.

> s'i's'a, s'yg'ys, s'an'g'ys'y, s'es'u's'i, s'i'i'e
> śíśa, śyǵys, śańǵyśy, śeśúśi, śííe



Re: 0027, 02BC, 2019, or a new character?

2018-01-22 Thread James Kass via Unicode
Martin J. Dürst wrote,

> ... One way to avoid confusion is to use one specific
> letter only as the second letter in digraphs. With the current orthography,
> they don't use w and x, so they could use one of these. But personally, I'd
> find accents more visually pleasing.

Me too:

(bottle, east, skier, crucial, cherry)
s'i's'a, s'yg'ys, s'an'g'ys'y, s'es'u's'i, s'i'i'e
sxixsxa, sxygxys, sxanxgxysxy, sxesxuxsxi, sxixixe
s̈ïs̈a, s̈yg̈ys, s̈an̈g̈ys̈y, s̈es̈üs̈i, s̈ïïe
śíśa, śyǵys, śańǵyśy, śeśúśi, śííe



Re: superscripts & subscripts for science/mathematics?

2018-01-22 Thread David Melik via Unicode

On 01/21/2018 02:27 PM, Frédéric Grosshans wrote:

Le 21/01/2018 à 07:15, David Melik via Unicode a écrit :
I don't know if this was discussed, but it'd help 
scientists/mathematicians if all Greek and Hebrew were available as 
superscript & subscript.  Mathematicians use certain such letters in 
standard notation of important expressions/formulae (superscript π in 
Euler's Identity, subscript base π, superscript א in cardinality of 
real numbers, etc.)... actually we use all Greek letters, and since a 
few Hebrew (since 1800s) have standard mathematical meanings, more 
are >>used for variables.  After any such alphabets' letters are used, 
the >>rest are considered normal/standard to use in standard script, 
>>superscript, and subscript, for any educational usage, and future 
>>standard notation.


Mathematics superscript and substript are supposed to be rich text, 
not plain text. Furthermore, “completing the set of mathematical 
superscripts” is an impossible task, since one would need double 
superscripts for e^(-x²) and even more exotic combinations for stuff 

as >common as e^x₁

On 01/22/2018 01:20 PM, Murray Sargent wrote:
Subscripts and superscripts are more complicated in mathematics than 
in ordinary text in that they can be nested and can include arbitrary 
operators, e.g., a superscripted superscript  as in e^(-x^2). 
Accordingly, encoding more Unicode subscripts and superscripts for 
mathematics isn't general enough to be worthwhile and it can 
complicate >math input methods. In plain text, one can use a linear 
format such as >LaTeX or UnicodeMath. Ideally these formats can drive 
math display >engines that display elegant mathematical typography with 
arbitrary >combinations of subscripts, superscripts and other 
mathematical >constructs.


‘The intended use was to allow chemical and algebra formulas to be 
written without 
markup’--https://en.wikipedia.org/wiki/Unicode_subscripts_and_superscripts. 
 Unless wrong, apart from  disagreement, it's clear mathematics word 
processing software is useful, but not a reason to not finish 
almost-complete set of basic superscripts & subscripts 
((super|sub)scripts) for relevant alphabets used (English, Greek, 
perhaps Hebrew, latter two which were in my original post subject line, 
but I likely accidentally used link I received to delete pre-moderated 
post.)  Before rich-text, people used plain-text centuries, still do, 
such as plain-text files that may be about simpler topics, or 
informal/notes, and Internet areas predating all websites, such as 
standard (such as this non-HTML) email, NNTP/Usenet (still hundreds of 
mathematics posts/day,) Internet Relay Chat (IRC, still dozens of 
science & mathematics rooms, one math one with around 1000 people, busy 
all the time) etc., but the latter at least has Unicode (none are 
rich-text.)
This shows how much of English has superscripts, and which letter 
doesn't: ᵃᵇᶜᵈᵉᶠᵍʰⁱʲᵏˡᵐⁿᵒᵖqʳˢᵗᵘᵛʷˣʸᶻ.  However, there are simple 
mathematics situations people use any/every letter, lowercase & 
uppercase, superscript & subscript (not sure about ‘overscript.’)
It's up to each science/math fan, student, writer, instructor what 
type of text they want (not just what you say is ‘supposed to be,’ ‘can 
complicate.’)  I never said
make Unicode like super-complicated stuff math formatting software... 
only a small percentage of where people write math, which of course, 
writing isn't just advanced books, but also simple & informal/notes, and 
plain-text isn't just in text-editors, but also graphics editors.
If not clearer now, all I was requesting was adding/completing 
Greek (super|sub)scripts, though had forgotten not all English ones 
exist, so those too, and I was suggesting Hebrew (super|sub)scripts... 
never mentioned supersuperscripts & subsubscripts, etc., which one of 
you showed then argued against (doesn't refute what I actually said.) 
I'm just talking about completing relevant alphabets for usage described 
‘chemical and algebra formulas,’ which as I took algebra before high 
school, wasn't seeing super-complicated stuff that may or not be in 
college/university algebra texts, or are in derived fields with some 
algebra-type formulae.  I'm only talking about simple, one-level 
(super|sub)scripts for largest variety of simple formulae, not 
‘completing the set’ (in relation to all math) nor 
(super|sub)(super|sub)scripts as in replies with mixed style.
The biggest problem for me is Euler's Formula & Identity, which 
through high school math of analysis/calculus (and on through several 
years to applied & abstract analysis) are usually considered the most 
important & beautiful formula & identity in mathematics (the formula 
modelling basis of all current physics, and the identity having the most 
important numbers, symbols/operations in math.)  It's easy to write his 
formula plain-text, as below.


eⁱˣ=cos x+i sin x

Almost every day in my plain-text notes/to-do-list, I read these, and 

Re: Proposed Expansion of Grapheme Clusters to Whole Aksharas - Implementation Issues

2018-01-22 Thread Mark Davis ☕️ via Unicode
Good point, thanks

Mark

On Mon, Jan 22, 2018 at 6:41 PM, Richard Wordingham via Unicode <
unicode@unicode.org> wrote:

> On Sun, 21 Jan 2018 22:34:12 -0800
> Mark Davis ☕️ via Unicode  wrote:
>
> > The ZWJ Virama sequence is already provided for by the combination of
> > GB9 & GB9c. But not the ZWNJ. If we want to handle that, it would
> > mean the addition of something like:
> >
> > GB9d: × (ZWNJ ViramaExtend* Virama)
>
> I don't think we need ViramaExtend* here.  The seqeunce should be
> followed by a base consonant, so there's no way for another mark to
> sneak in.
>
> Incidentally, I think ViramaExtend would be better named as NSExtend,
> with 'NS' for 'non-starter'.
>
> Richard.
>
>


Re: Proposed Expansion of Grapheme Clusters to Whole Aksharas - Implementation Issues

2018-01-22 Thread Richard Wordingham via Unicode
On Sun, 21 Jan 2018 22:34:12 -0800
Mark Davis ☕️ via Unicode  wrote:

> I was looking the feedback in http://www.unicode.org/review/pri355/,
> and didn't see yours there. Could you please file your feedback
> there? (Nothing on this list is tracked by the committee...)

This is the submission I have just made:

The major principled issue I have is that UAX#29 can no longer claim to
have a sound definition of the concept of a 'user-perceived character'.
Perhaps it never did.

Some of the claims would be better if there were
evidence to back them up.  For example, this evening I did a quick bit
of research and asked the Korean owner of the local Korean restaurant
how many letters there were in the hangul spelling of 'Gangnam'.  She
traced out the spelling of the word (강남) and came back with the
answer '6'. UAX#29 claims it has 2 user-perceived characters.  You
might also argue that she has spent too long in England to be a useful
informant.

The following old paragraph causes grief for me:

"As far as a user is concerned, the underlying representation of text
is not important, but it is important that an editing interface present
a uniform implementation of what the user thinks of as characters.
Grapheme clusters commonly behave as units in terms of mouse selection,
arrow key movement, backspacing, and so on. For example, when a grapheme
cluster is represented internally by a character sequence consisting of
base character + accents, then using the right arrow key would skip
from the start of the base character to the end of the last accent."

The problem is that many editors read this as saying that the arrow
keys should move by whole characters.  The result of this is that in
many applications, to replace the first character of a grapheme cluster
one must retype the entire grapheme cluster.  With a grapheme cluster
of three characters, as is common in Thai and Korean, this is
irritating.  With a grapheme cluster of four or five characters, as is
common in Northern Thai, it is annoying.

The prospect of the grapheme cluster being extended to include a whole
akshara fills me with dismay.  Consider the Northern Thai word ᩉ᩠ᨾᩰᩬᩫᩡ
 /mɔʔ/ 'scrumptious'.  At present,
this 7 character word is split into three grapheme clusters, of lengths
2, 4 and 1.  However, it is clearly a single akshara.  To change the
first character, I would have to also retype the other 6 characters.

My first thought that changing software that way would breach the
UK's Equality Act 2010, by further restricting the ability of Northern
Thai users to do character by character editing.  (My wife's
protected characteristic extends to me for the purposes of the
Act.)  However, there may be a get-out in the form of Schedule 3 Section
30
(https://www.legislation.gov.uk/ukpga/2010/15/schedule/3/paragraph/30).
The supplier of the service can claim that they only supply a character
by character editing facility to the ethnic groups using simple scripts,
and that they are under no obligation to supply the service to members
of other ethnic groups. - 
"If a service is generally provided only for persons who share a
protected characteristic, a person (A) who normally provides the
service for persons who share that characteristic does not contravene
section 29(1) or (2)—

(a)by insisting on providing the service in the way A normally provides
it, or

(b)if A reasonably thinks it is impracticable to provide the service to
persons who do not share that characteristic, by refusing to provide
the service."

But what an embarrassing defence to offer!

However, there is another reason for rejecting the extension of
grapheme clusters to whole aksharas.  Currently, U+1A63 TAI THAM
VOWEL SIGN AA starts a grapheme cluster.  However, for non-defective
text, it is part of the same akshara as the preceding grapheme
cluster.  Now, the decision to make U+1A63 start a new grapheme cluster
is intrinsically reasonable.  It can have its own stack with a subscript
consonant and even a vowel, and it is not difficult to find manuscripts
showing a line break before it, e.g. L2/07-007 Figure 9b Leaf 2 lines
2/3, ᩈᨾᩮᩣᨴ᩠ᨴᨾ-ᩣᨶᩮᩉᩥ.

I believe that the akshara should be a level of text above the grapheme
cluster.  Ideally, it would be below the level of a word, but of course
in Sanskrit, word boundaries readily occur within present day grapheme
clusters.  (I made this recommendation in L2/17-122.)

Further comments apply to the definition of akshara boundaries,
regardless of whether they are to coincide with the boundaries of
grapheme clusters.

These rules do not work well where virama may fall back to visible
virama.  This is particularly the case with Tamil, where conjuncts are
restricted to K.SSA and SH.RII.  Johny Cibu provided an example where
the title துக்ளக் is broken as [ta-u,
ka-virama, lla, ka-virama]. However, as per the proposed algorithm 

Re: Proposed Expansion of Grapheme Clusters to Whole Aksharas - Implementation Issues

2018-01-22 Thread Richard Wordingham via Unicode
On Sun, 21 Jan 2018 22:34:12 -0800
Mark Davis ☕️ via Unicode  wrote:

> The ZWJ Virama sequence is already provided for by the combination of
> GB9 & GB9c. But not the ZWNJ. If we want to handle that, it would
> mean the addition of something like:
> 
> GB9d: × (ZWNJ ViramaExtend* Virama)

I don't think we need ViramaExtend* here.  The seqeunce should be
followed by a base consonant, so there's no way for another mark to
sneak in.

Incidentally, I think ViramaExtend would be better named as NSExtend,
with 'NS' for 'non-starter'.

Richard.



Re: Proposed Expansion of Grapheme Clusters to Whole Aksharas - Implementation Issues

2018-01-22 Thread Richard Wordingham via Unicode
On Sun, 21 Jan 2018 22:34:12 -0800
Mark Davis ☕️ via Unicode  wrote:
 
> FYI, I'm thinking now that the change should be:
> 
> GB9c: (Virama | ZWJ )   × LinkingConsonant
> =>  
> GB9c: (Virama ViramaExtend* | ZWJ ) × LinkingConsonant
> 
> where ViramaExtend = [Extend - Virama - \p{ccc=0}]
> (This is pre-partitioning.)
> 
> That is close to your formulation, but for for canonical equivalence,
> there shouldn't need to allow the ViramaExtend after ZWJ, because the
> ZWJ has ccc=0, and thus nothing reorders around it.

These look fine.

> Cibu also pointed out on a different thread that for Malayalam we
> need to consider a couple of other forms:
> 
> ... Following contexts should be allowed for requesting reformed or
> traditional conjuncts as per Unicode10.0.0/ch12 page 505.  ...
> 
> /$L ZWNJ $V $L/
> /$L ZWJ $V $L/
> 
> The ZWJ Virama sequence is already provided for by the combination of
> GB9 & GB9c. But not the ZWNJ. If we want to handle that, it would
> mean the addition of something like:
> 
> GB9d: × (ZWNJ ViramaExtend* Virama)

This is OK by me for aksharas.  It might make sense for Tai Tham as
well, where various degrees of binding are attested in what you can
think of as D.DH (as in 'buddha').  If the font formally ligates them
but does not always ligate subscript 'DHA' (i.e. U+1A35 TAI THAM LETTER
LOW THA),  would provide the unligated
form.  Note than in Tai Tham, SAKOT primarily affects the C2 consonant.

> 
> Cibu also wrote:
> 
> 
> Also, when we disallow /$L $V ZWJ $D/, it is disallowing the sequences
> involving legacy chillus. That is, for example,  E> is a valid sequence (Examples in Unicode10.0.0/ch12 Table 12.36).
> E> It's legacy
> equivalent would be . It might be OK to
> disallow this; but, we should be mindful of this side effect.

I see no problem here.  By GB9, we get 

NA × VIRAMA × ZWJ SIGN_E

By GB9a, we then get

 NA × VIRAMA × ZWJ × SIGN_E

Have I missed something?

Do you want me to try to formally submit my comments from this post?  I
will be going to bed as soon as I've finished extract comments from
this thread.

Richard.



Re: 0027, 02BC, 2019, or a new character?

2018-01-22 Thread Martin J. Dürst via Unicode

On 2018/01/23 09:55, James Kass via Unicode wrote:


Any Kazakh/Qazaq student ambitious enough to study a foreign language
such as English is already sophisticated enough to easily distinguish
differing digraph values between the two languages.  English speakers
face distinctions such as the difference between the "ch" in "chigger"
versus "chiffon" daily without any apparent danger of confusion.


Well, there are many many easier orthographies than English, so I'd 
understand if the Kazakh don't want to take English as an example.



With
so much push-back, along with technical objections, hopefully the
government will reconsider the apostrophe situation and go with
digraphs or diacritics.


I very much hope so too. One way to avoid confusion is to use one 
specific letter only as the second letter in digraphs. With the current 
orthography, they don't use w and x, so they could use one of these. But 
personally, I'd find accents more visually pleasing.


Regards,   Martin.


Re: 0027, 02BC, 2019, or a new character?

2018-01-22 Thread James Kass via Unicode
Phake Nick wrote,

> ... and it is not possible for e.g. a regular American
> user using Windows to simply type them out, at least not
> without prior knowledge about these umlauts.

Regular American users simply don't type umlauts, period.  Eccentric
American users needing umlauts, such as foreign language students or
heavy metal enthusiasts, generally find an easy way.  Practically
everybody knows how to search the web.

Earlier in this thread, Shriramana Sharma wrote,

> Rejecting the digraph method (which is probably the
> simplest) doesn't have much meaning because they have
> different sounds in different languages all the time
> like ch in English and German.

Any Kazakh/Qazaq student ambitious enough to study a foreign language
such as English is already sophisticated enough to easily distinguish
differing digraph values between the two languages.  English speakers
face distinctions such as the difference between the "ch" in "chigger"
versus "chiffon" daily without any apparent danger of confusion.  With
so much push-back, along with technical objections, hopefully the
government will reconsider the apostrophe situation and go with
digraphs or diacritics.


Re: 0027, 02BC, 2019, or a new character?

2018-01-22 Thread Richard Wordingham via Unicode
On Mon, 22 Jan 2018 11:35:16 +0800
Phake Nick via Unicode  wrote:

> There
> are language-dependent keyboards for French or German with special
> keys or deadkeys that help input these umlauts, but they are language
> dependent and it is not possible for e.g. a regular American user
> using Windows to simply type them out, at least not without prior
> knowledge about these umlauts.

I found the Windows 'US International' keyboard layout highly intuitive
for accented Latin-1 characters.

Richard.


Re: Internationalised Computer Science Exercises

2018-01-22 Thread Richard Wordingham via Unicode
On Mon, 22 Jan 2018 18:55:16 +0100
Frédéric Grosshans via Unicode  wrote:

> A simple challenge is to write a function which localize numbers in a 
> script having decimal digits or parse them (i.e. which have
> characters with property Numeric_Type=Decimal, as explained in §4.6
> of the Unicode 10 standard). The list of these scripts is specified
> in table 22-3. There is usually a most one set of digits/script (with
> the exception of Arabic, Myanmar and Tai Tham).

Presumably you specify the task by defining the digit for zero.  Would
you expect them to successfully parse '10︀2' (with diagonal in middle
digit) as opposed to '102'?  Do you expect them to get the New Tai Lue
form for the number '1' correct - it's U+19DA rather than U+19D1!

Richard.



Re: Internationalised Computer Science Exercises

2018-01-22 Thread Richard Wordingham via Unicode
On Mon, 22 Jan 2018 16:39:57 +
Andre Schappo via Unicode  wrote:


> By way of example, one programming challenge I set to students a
> couple of weeks ago involves diacritics. Please see
> jsfiddle.net/coas/wda45gLp

Did any of them come up with the idea of using traces instead of
strings?

Richard.


Re: Internationalised Computer Science Exercises

2018-01-22 Thread Frédéric Grosshans via Unicode

Le 22/01/2018 à 17:39, Andre Schappo via Unicode a écrit :


By way of example, one programming challenge I set to students a 
couple of weeks ago involves diacritics. Please see 
jsfiddle.net/coas/wda45gLp 


There is huge potential for some really interesting and challenging 
Unicode exercises. If you have any suggestions for such exercises they 
would be most welcome. Email me direct or share on this list.


A simple challenge is to write a function which localize numbers in a 
script having decimal digits or parse them (i.e. which have characters 
with property Numeric_Type=Decimal, as explained in §4.6 of the Unicode 
10 standard). The list of these scripts is specified in table 22-3. 
There is usually a most one set of digits/script (with the exception of 
Arabic, Myanmar and Tai Tham).


Then, of course, one can look at other numeral systems (CJK, Ethiopic, 
Roman, to name a few in contemporaneous use). The section 22.3 of the 
Unicode standard is an interesting starting point for these.



A internationalised exercise which doesn’t (always) use unicode is the 
localization of separators in numbers: 2¹⁰+π = 1,027.14 in US and 1 
027,14 in France. One also should not forget that half a million is 
5,00,000 in India. These simple things can be very surprising the first 
time you meet them.


  Frédéric


Internationalised Computer Science Exercises

2018-01-22 Thread Andre Schappo via Unicode

I continue my endeavours to get Unicode and Internationalisation into/onto (I 
am not sure which is correct) University and School Curricula. Here is another 
of my endeavours

Yesterday, I drafted a final year student project specification for the 
2018/2019 academic year. These projects will start in October but students will 
be choosing their project some time around June. The project involves producing 
a set of internationalised Computer Science exercises for both educators and 
students. Details at 
schappo.blogspot.co.uk/2018/01/computer-science-internationalization_21.html

I am confident that more than one student will choose this project.

By way of example, one programming challenge I set to students a couple of 
weeks ago involves diacritics. Please see 
jsfiddle.net/coas/wda45gLp

There is huge potential for some really interesting and challenging Unicode 
exercises. If you have any suggestions for such exercises they would be most 
welcome. Email me direct or share on this list.

TIA

André Schappo



SignWriting in U+40000 block

2018-01-22 Thread Doug Ewell via Unicode
The IETF is noting the progress of an updated draft:

Formal SignWriting
draft-slevinski-formal-signwriting-04
https://tools.ietf.org/html/draft-slevinski-formal-signwriting-04.html

which continues to describe an implementation of SignWriting in the
as-yet unassigned Plane 4, including a detailed breakdown of blocks for
different types of characters.

I know the struggle between Slevinski and Unicode is long and
contentious, with Slevinski arguing for years that the Unicode encoding
of SignWriting is useless because it doesn't encode position, and vowing
that no implementation (under his aegis) will ever use it).

Nevertheless, I wonder if it would be appropriate for Unicode or WG2, in
some capacity, to protest in some formal way against this recommendation
to arrogate an unassigned plane instead of using the PUA, which is the
correct place for unassigned characters.
 
--
Doug Ewell | Thornton, CO, US | ewellic.org