Re: Unicode education in Schools

2017-08-26 Thread Richard Wordingham via Unicode
On Fri, 25 Aug 2017 09:36:44 -0400
John W Kennedy  wrote:

> Just a reminder that in Apple’s Swift a “Character” is anything that
> looks like a character, including a letter with any theoretically
> unlimited stack of diacritics, a flag, or a skin-toned emoji, and all
> Swift functions working with characters, strings, and substrings
> count characters in this way. There is an underlying store that is,
> for historic reasons, UTF-16, and that can be accessed, but so can
> UTF-8 and UTF-32.

Can the individual Unicode characters be accessed one by one, e.g. for
searching for vowels or other such 'diacritics'?  Or would one only
have access to the code units?

Could one easily search for a subjoined consonant, e.g. COENG RO
 in Khmer, where the
two constituent characters would be in adjacent extended grapheme
clusters?

Richard.




Re: Unicode education in Schools

2017-08-26 Thread Eli Zaretskii via Unicode
> Date: Sat, 26 Aug 2017 22:07:57 +0100
> From: Richard Wordingham via Unicode 
> 
> > We are miscommunicating.  My point was that programming for MS-Windows
> > needs a good understanding of what the UTF-16 surrogates are, and in
> > what MS-Windows APIs/library functions they can and cannot be used.
> > Without this understanding, one cannot figure out why the likes of
> > iwspace and iswupper only support the BMP, and what APIs to use to
> > lift this limitation.  Likewise with display-related APIs, used to
> > display Unicode text.
> 
> > If you don't teach UTF-16 including these details, the programmers
> > will feel lost when they meet with these complications.
> 
> So what's new compared to UTF-8?

Who said this is new?  I said this needs to be _taught_, or else
people will be ignorant about these subtleties.


Re: Unicode education in Schools

2017-08-26 Thread Richard Wordingham via Unicode
On Sat, 26 Aug 2017 21:20:45 +0300
Eli Zaretskii via Unicode  wrote:

> > Date: Sat, 26 Aug 2017 18:52:03 +0100
> > From: Richard Wordingham via Unicode 

> We are miscommunicating.  My point was that programming for MS-Windows
> needs a good understanding of what the UTF-16 surrogates are, and in
> what MS-Windows APIs/library functions they can and cannot be used.
> Without this understanding, one cannot figure out why the likes of
> iwspace and iswupper only support the BMP, and what APIs to use to
> lift this limitation.  Likewise with display-related APIs, used to
> display Unicode text.

> If you don't teach UTF-16 including these details, the programmers
> will feel lost when they meet with these complications.

So what's new compared to UTF-8?  The problem would be a misconception
that MSVC's wchar_t supported Unicode - or has that been fixed
recently?  The neutral message is to avoid wchar_t where possible.

C++11 and C11's char32_t ought to have fixed the problem.

Functions iswspace() and iswlower() are not stable, one really has to
replace them by the project's UCD routines.  For example, when the
locale is a Unicode locale with the obvious wchar_t representations, the
value of iswlower(0x13A0) recently changed from non-zero to zero, as
U+13A0 changed from gc=Lo to gc=Lu.  I don't think iswupper() is any
stabler.

Richard.


Re: Unicode education in Schools

2017-08-26 Thread Eli Zaretskii via Unicode
> Date: Sat, 26 Aug 2017 18:52:03 +0100
> From: Richard Wordingham via Unicode 
> 
> > > It shouldn't.  UTF-16 works just like UTF-8, except that the code
> > > units are bigger.  
> 
> > Not really, since UTF-8 doesn't have surrogates.
> 
> It has 115 surrogates, thoroughly oppressed by the UTC - there are 64
> trailing surrogates 0x80 to 0xBF, 51 leading surrogates 0xC2 to 0xF4 ,
> and 0xC0, 0xC1 and 0xF5 to 0xFF suffer the indignity of being the 13
> uncodepoints - not even allowed in Unicode 8-bit strings. Emacs is one
> of the few systems that comes close to allowing them the dignity of
> integer values of their own - 3FFF80₁₆ to 3F₁₆ for the code units
> 0x80 to 0xFF.
> 
> I well remembered when Unicode regular expressions were required to
> allow one to search for lone surrogates, but there was no such concept
> of looking for isolated ill-associated bytes in Unicode 8-bit strings.
> 
> The point is that if one understands how UTF-8 works, UTF-16 is a
> system that works using a subset of the same principles, and one should
> therefore understand how UTF-16 works, until one comes to the weird and
> dubious concept of surrogate points having properties.  I believe the
> latter concept is of value only in code that lacks the concept of
> gibberish.  In UTF-8, the distinction between code unit value and
> Unicode scalar value is very clear; in UTF-16, it is muddied by the
> concept of 'codepoint'.

We are miscommunicating.  My point was that programming for MS-Windows
needs a good understanding of what the UTF-16 surrogates are, and in
what MS-Windows APIs/library functions they can and cannot be used.
Without this understanding, one cannot figure out why the likes of
iwspace and iswupper only support the BMP, and what APIs to use to
lift this limitation.  Likewise with display-related APIs, used to
display Unicode text.

If you don't teach UTF-16 including these details, the programmers
will feel lost when they meet with these complications.


Re: Unicode education in Schools

2017-08-26 Thread Richard Wordingham via Unicode
On Sat, 26 Aug 2017 18:55:25 +0300
Eli Zaretskii via Unicode  wrote:

> > Date: Sat, 26 Aug 2017 16:09:33 +0100
> > From: Richard Wordingham via Unicode 

> > It shouldn't.  UTF-16 works just like UTF-8, except that the code
> > units are bigger.  

> Not really, since UTF-8 doesn't have surrogates.

It has 115 surrogates, thoroughly oppressed by the UTC - there are 64
trailing surrogates 0x80 to 0xBF, 51 leading surrogates 0xC2 to 0xF4 ,
and 0xC0, 0xC1 and 0xF5 to 0xFF suffer the indignity of being the 13
uncodepoints - not even allowed in Unicode 8-bit strings. Emacs is one
of the few systems that comes close to allowing them the dignity of
integer values of their own - 3FFF80₁₆ to 3F₁₆ for the code units
0x80 to 0xFF.

I well remembered when Unicode regular expressions were required to
allow one to search for lone surrogates, but there was no such concept
of looking for isolated ill-associated bytes in Unicode 8-bit strings.

The point is that if one understands how UTF-8 works, UTF-16 is a
system that works using a subset of the same principles, and one should
therefore understand how UTF-16 works, until one comes to the weird and
dubious concept of surrogate points having properties.  I believe the
latter concept is of value only in code that lacks the concept of
gibberish.  In UTF-8, the distinction between code unit value and
Unicode scalar value is very clear; in UTF-16, it is muddied by the
concept of 'codepoint'.

Richard.



Re: Unicode education in Schools

2017-08-26 Thread Eli Zaretskii via Unicode
> Date: Sat, 26 Aug 2017 16:09:33 +0100
> From: Richard Wordingham via Unicode 
> 
> > > Just steer them away from UTF-16!  
> > 
> > Which will leave them entirely unprepared for the MS-Windows Unicode
> > programming, something they of course will never need in their
> > careers.
> 
> It shouldn't.  UTF-16 works just like UTF-8, except that the code units
> are bigger.

Not really, since UTF-8 doesn't have surrogates.


Re: Unicode education in Schools

2017-08-26 Thread Richard Wordingham via Unicode
On Fri, 25 Aug 2017 09:36:00 +0300
Eli Zaretskii via Unicode  wrote:

> > Date: Fri, 25 Aug 2017 00:23:40 +0100
> > From: Richard Wordingham via Unicode 
> > 
> > On Thu, 24 Aug 2017 17:17:10 +
> > Andre Schappo via Unicode  wrote:
> >   
> > > So, I consider it important to familiarise students with SMP
> > > characters as well as BMP characters. Then when they develop
> > > software they will, at the start, be thinking beyond ASCII and
> > > Unicode BMP characters.  
> > 
> > Just steer them away from UTF-16!  
> 
> Which will leave them entirely unprepared for the MS-Windows Unicode
> programming, something they of course will never need in their
> careers.

It shouldn't.  UTF-16 works just like UTF-8, except that the code units
are bigger.  The problem is that accidentally ignoring the difference
between UTF-16 and UCS-2 takes longer to be detected, and therefore
correcting the error may be very difficult.  Ignoring the difference
between ASCII (or an 8-bit coding) and UTF-8 shows up very quickly, and
therefore is less difficult to fix, for less is broken by the obvious
correction.

Richard.



Re: Unicode education in Schools

2017-08-26 Thread Richard Wordingham via Unicode
On Fri, 25 Aug 2017 12:57:37 +0100 (BST)
William_J_G Overington via Unicode  wrote:

> UTF-16 is very useful. I use it in my research project.

> If the byte content of a UTF-16 file is displayed in a hexadecimal
> display then for all plane 0 characters the byte content of the
> character codes are thereby displayed directly.

But only plane 0.

How tedious (and expensive) would it be to obtain a licence to convert,
and freely share, the UCD to UTF-8 or UTF-16?  The code charts might
have to be a separate issue because of the fonts.

> Also, all characters that can be encoded in Unicode can be stored in
> a UTF-16 file.

Or UTF-8.  UTF-32 support is a bit limited.

Richard.


Re: Unicode education in Schools

2017-08-26 Thread Norbert Lindenberg via Unicode
ECMAScript 6 fixed that, largely along the lines of my proposal:
http://norbertlindenberg.com/2012/05/ecmascript-supplementary-characters/index.html

Norbert


> On Aug 24, 2017, at 22:14 , Peter Constable via Unicode <unicode@unicode.org> 
> wrote:
> 
> I thought Javascript had a UCS-2 understanding of Unicode strings. Has it 
> managed to progress beyond that?
> 
>  
> 
>  
> 
> Peter
> 
>  
> 
>  
> 
> From: Unicode [mailto:unicode-boun...@unicode.org] On Behalf Of David Starner 
> via Unicode
> Sent: Thursday, August 24, 2017 5:18 PM
> To: Unicode Mailing List <unicode@unicode.org>
> Subject: Fwd: Unicode education in Schools
> 
>  
> 
>  
> 
> -- Forwarded message -
> From: David Starner <prosfil...@gmail.com>
> Date: Thu, Aug 24, 2017, 6:16 PM
> Subject: Re: Unicode education in Schools
> To: Richard Wordingham <richard.wording...@ntlworld.com>
> 
>  
> 
>  
> 
> On Thu, Aug 24, 2017, 5:26 PM Richard Wordingham via Unicode 
> <unicode@unicode.org> wrote:
> 
> Just steer them away from UTF-16!  (And vigorously prohibit the very
> concept of UCS-2).
> 
> Richard.
> 
>  
> 
> Steer them away from reinventing the wheel. If they use Java, use Java 
> strings. If they're using GTK, use strings compatible with GTK. If they're 
> writing JavaScript, use JavaScript strings. There's basically no system 
> without Unicode strings or that they would be better off rewriting the wheel.
> 




Re: Unicode education in Schools

2017-08-25 Thread William_J_G Overington via Unicode
Richard Wordingham wrote:

> Just steer them away from UTF-16!  (And vigorously prohibit the very concept 
> of UCS-2).

UTF-16 is very useful. I use it in my research project.

If the byte content of a UTF-16 file is displayed in a hexadecimal display then 
for all plane 0 characters the byte content of the character codes are thereby 
displayed directly.

Also, all characters that can be encoded in Unicode can be stored in a UTF-16 
file.

William Overington

Friday 25 August 2017


 
Original message
>From : unicode@unicode.org
Date : 2017/08/25 - 00:23 (GMTST)
To : unicode@unicode.org
Subject : Re: Unicode education in Schools

On Thu, 24 Aug 2017 17:17:10 +
Andre Schappo via Unicode <unicode@unicode.org> wrote:

> So, I consider it important to familiarise students with SMP
> characters as well as BMP characters. Then when they develop software
> they will, at the start, be thinking beyond ASCII and Unicode BMP
> characters.

Just steer them away from UTF-16!  (And vigorously prohibit the very
concept of UCS-2).

Richard.



Re: Unicode education in Schools

2017-08-25 Thread Mark Davis ☕️ via Unicode
Mark

(https://twitter.com/mark_e_davis)

On Thu, Aug 24, 2017 at 11:01 PM, Asmus Freytag via Unicode <
unicode@unicode.org> wrote:

> On 8/24/2017 10:17 AM, Andre Schappo via Unicode wrote:
>
>> Because there are many systems that can now handle BMP characters but not
>> cannot handle SMP characters.
>>
>> One example being systems that use mysql utf8 (3 byte encoding) and have
>> not yet updated to utf8mb4 (4 byte encoding)
>>
>> So, I consider it important to familiarise students with SMP characters
>> as well as BMP characters. Then when they develop software they will, at
>> the start, be thinking beyond ASCII and Unicode BMP characters.
>>
>
> The thinking "beyond BMP" part only comes in when you work in encoding
> forms where the BMP uses a different number of code units than the SMP (or
> any other non-BMP "page"). This is true for both utf8 and utf16 but not if
> you work in utf32 or in scalar values (as in the posted exercise).
>
>
> The trick with using emoji in this lesson is that the descriptions and
> images are meaningful to any English speaker, so it gets the student to
> learn about character names.
>
> The same exercise would be more of a challenge for students whose native
> tongue is not English.


​> The trick with using emoji...

True. For emoji names it would be better to use the CLDR names with
non-anglophone audiences, since those names are available in a number of
languages.

eg http://www.unicode.org/cldr/charts/31/annotations/romance.html# (that
was last release's version; next release will have improvements...)
​

>
>
> A./
>
>
>> André Schappo
>>
>> On 24 Aug 2017, at 17:45, Shriramana Sharma  wrote:
>>>
>>> So how do you think it matters if the characters are in the BMP or SMP?
>>>
>>
>>
>>
>


Re: Unicode education in Schools

2017-08-25 Thread Eli Zaretskii via Unicode
> Date: Fri, 25 Aug 2017 00:23:40 +0100
> From: Richard Wordingham via Unicode 
> 
> On Thu, 24 Aug 2017 17:17:10 +
> Andre Schappo via Unicode  wrote:
> 
> > So, I consider it important to familiarise students with SMP
> > characters as well as BMP characters. Then when they develop software
> > they will, at the start, be thinking beyond ASCII and Unicode BMP
> > characters.
> 
> Just steer them away from UTF-16!

Which will leave them entirely unprepared for the MS-Windows Unicode
programming, something they of course will never need in their
careers.


RE: Unicode education in Schools

2017-08-25 Thread via Unicode
Use String.codePointAt() etc.



El ago. 24, 2017 10:42 PM -0700, Shriramana Sharma via Unicode 
, escribió:
> IIUC the limitation seems to be only that functions such as "charAt" do not 
> recognize that surrogates aren't valid characters:
>
> https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/charAt
>  via https://stackoverflow.com/a/8716157/1503120.
>
> This is a problem of many 32-bit char based toolkits too and doesn't (can't?) 
> have an efficient solution for SMP without counting the surrogates (and 
> checking them). Right?


RE: Unicode education in Schools

2017-08-24 Thread Shriramana Sharma via Unicode
IIUC the limitation seems to be only that functions such as "charAt" do not
recognize that surrogates aren't valid characters:

https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/charAt
via https://stackoverflow.com/a/8716157/1503120.

This is a problem of many 32-bit char based toolkits too and doesn't
(can't?) have an efficient solution for SMP without counting the surrogates
(and checking them). Right?


RE: Unicode education in Schools

2017-08-24 Thread Peter Constable via Unicode
I thought Javascript had a UCS-2 understanding of Unicode strings. Has it 
managed to progress beyond that?


Peter


From: Unicode [mailto:unicode-boun...@unicode.org] On Behalf Of David Starner 
via Unicode
Sent: Thursday, August 24, 2017 5:18 PM
To: Unicode Mailing List <unicode@unicode.org>
Subject: Fwd: Unicode education in Schools


-- Forwarded message -
From: David Starner <prosfil...@gmail.com<mailto:prosfil...@gmail.com>>
Date: Thu, Aug 24, 2017, 6:16 PM
Subject: Re: Unicode education in Schools
To: Richard Wordingham 
<richard.wording...@ntlworld.com<mailto:richard.wording...@ntlworld.com>>


On Thu, Aug 24, 2017, 5:26 PM Richard Wordingham via Unicode 
<unicode@unicode.org<mailto:unicode@unicode.org>> wrote:
Just steer them away from UTF-16!  (And vigorously prohibit the very
concept of UCS-2).

Richard.

Steer them away from reinventing the wheel. If they use Java, use Java strings. 
If they're using GTK, use strings compatible with GTK. If they're writing 
JavaScript, use JavaScript strings. There's basically no system without Unicode 
strings or that they would be better off rewriting the wheel.


Re: Unicode education in Schools

2017-08-24 Thread Philippe Verdy via Unicode
Strings in Java and JavaScript are basically the same as they are arbitrary
sequences of 16-bit code units, and not restricted to text with valid
UTF-16 encoding. The differences are in the set of access methods, but they
are both normally immutable, and both allow (but do enforce) substrings to
share their backing store between distinct instances. The same applies to
C/C++ "wide strings" when their code units are larger than 1 byte, but
C/C++ do not make them immutable, except using dedicated classes, which
will transiently allow setting their content through constructors, and
C/C++ wide strings exist with several signed and unsigned code units (when
Java only have unsigned 16-bit code units in their "char", and Javascript
has no "char" type but only "Number" types with valid range restrictions
applied when constructing String instances from code units or from
codepoint values.

Javascript should soon have a new numeric type (it is provisionnaly named
"BigInt", a signed 64-bit integer and will have constants sufixed by "n",
and there will be no implicit promotion from/to Number but only explicit
conversions by checked constructors) and new code unit types for mutable
buffers (but only for the rangechecks of their write accessors, using
"Number" 64-bit floating points or the newer "BigInt" 64-bit integers)

There are similar designs in Perl, PHP, and most languages: Unicode support
and conformance for using these types for valid text is implemented only by
libraries in their standard text API or in their I/O APIs taking immutable
strings or mutable buffers in parameters, or returning sharable but
immutable string instances or a mutable buffer referenced on input or
allocated internally, but these API's are not restricted to just valid
Unicode text handling and allow using their strings with any other encoding.

With immutable strings implemented as classes, the backing store is
normally not directly accessible even by reference, you can just reference
the class referencing internally the backing store... implemented using
mutable buffers and using an internal encoding which may be different from
the one exposed by the string class (possibly using compression technics
for their backing store, on demand, and implicit atomization of most
frequently used string values, notably the empty string and string values
representing a single character with an 8-bit only code point value, or
strings containing any repetition of the same code point value:  these
values do not need any internally allocated buffer for their backing store,
so these instances are allocated very fast, and do not stress the garbage
collector when they are no longer used).

When Unicode text handling methods are supported by their exposed methods,
the Unicode validation rules are not necessarily checked everywhere, so it
is still possible to have strings or buffers containing a single unpaired
surrogate value. The backing store may also allow storing code units
outside the ranges used by valid UTF-16 or valid UTF-32 (the backing stores
are virtualized and could be on disk and swapped on demand with reusable
buffers from a pool).

2017-08-25 2:17 GMT+02:00 David Starner via Unicode <unicode@unicode.org>:

>
>
> -- Forwarded message -----
> From: David Starner <prosfil...@gmail.com>
> Date: Thu, Aug 24, 2017, 6:16 PM
> Subject: Re: Unicode education in Schools
> To: Richard Wordingham <richard.wording...@ntlworld.com>
>
>
>
>
> On Thu, Aug 24, 2017, 5:26 PM Richard Wordingham via Unicode <
> unicode@unicode.org> wrote:
>
>> Just steer them away from UTF-16!  (And vigorously prohibit the very
>> concept of UCS-2).
>>
>> Richard.
>>
>
> Steer them away from reinventing the wheel. If they use Java, use Java
> strings. If they're using GTK, use strings compatible with GTK. If they're
> writing JavaScript, use JavaScript strings. There's basically no system
> without Unicode strings or that they would be better off rewriting the
> wheel.
>
>>


Re: Unicode education in Schools

2017-08-24 Thread Richard Wordingham via Unicode
On Thu, 24 Aug 2017 17:17:10 +
Andre Schappo via Unicode  wrote:

> So, I consider it important to familiarise students with SMP
> characters as well as BMP characters. Then when they develop software
> they will, at the start, be thinking beyond ASCII and Unicode BMP
> characters.

Just steer them away from UTF-16!  (And vigorously prohibit the very
concept of UCS-2).

Richard.


Re: Unicode education in Schools

2017-08-24 Thread Asmus Freytag via Unicode

On 8/24/2017 10:17 AM, Andre Schappo via Unicode wrote:

Because there are many systems that can now handle BMP characters but not 
cannot handle SMP characters.

One example being systems that use mysql utf8 (3 byte encoding) and have not 
yet updated to utf8mb4 (4 byte encoding)

So, I consider it important to familiarise students with SMP characters as well 
as BMP characters. Then when they develop software they will, at the start, be 
thinking beyond ASCII and Unicode BMP characters.


The thinking "beyond BMP" part only comes in when you work in encoding 
forms where the BMP uses a different number of code units than the SMP 
(or any other non-BMP "page"). This is true for both utf8 and utf16 but 
not if you work in utf32 or in scalar values (as in the posted exercise).



The trick with using emoji in this lesson is that the descriptions and 
images are meaningful to any English speaker, so it gets the student to 
learn about character names.


The same exercise would be more of a challenge for students whose native 
tongue is not English.


A./


André Schappo


On 24 Aug 2017, at 17:45, Shriramana Sharma  wrote:

So how do you think it matters if the characters are in the BMP or SMP?







Re: Unicode education in Schools

2017-08-24 Thread Philippe Verdy via Unicode
2017-08-24 19:17 GMT+02:00 Andre Schappo via Unicode :

>
> Because there are many systems that can now handle BMP characters but not
> cannot handle SMP characters.
>
> One example being systems that use mysql utf8 (3 byte encoding) and have
> not yet updated to utf8mb4 (4 byte encoding)
>

Mysql's utf8 is known to cause severe problems, notably on wikis installed
by default with it: the presence of any non-BMP character (SMP or emojis
are now very frequent and available on almost all modern smartphones) in
the edited text will cause its **silent** truncation when uploading it to
the server (when it will save the text to the database) even if any unsaved
preview was correct. You will see the truncation when the page is loaded
again.

Mysql's "utf8" should have been dropped since long and replaced by utf8mb4
or setup so that data send to an "utf8"-encoded database would cause a SQL
error that cannot be silently ignored with truncation (or it least it
should only cause the non-BMP characters to be filtered out, without
silently deleting everything that follows).

This is an old severe bug of Mysql (on the server itself) or in the
connection protocol, or internal filters used by Mysql client library, that
has caused many severe security issues (such as discarding logs or todo
lists, or loss of pending commercial transactions such as lists of payments
to process to a bank or truncated billings sent to customers, or loss of
contact address or name, or broken complete addresses for product delivery
to a customer, or missing items in a delivered box and lost products in the
middle of their routing).

This is a demosntration that not signaling encoding errors to an
application, or not clearly specifiying that an API may cause encoding
exceptions that must be caught and must not ignored in applications, can
hurt. Even if you use "utf8mb4" encoding errors are still possible and must
not be ignored as the final result will be unpredictable.


Re: Unicode education in Schools

2017-08-24 Thread Andre Schappo via Unicode

Because there are many systems that can now handle BMP characters but not 
cannot handle SMP characters.

One example being systems that use mysql utf8 (3 byte encoding) and have not 
yet updated to utf8mb4 (4 byte encoding)

So, I consider it important to familiarise students with SMP characters as well 
as BMP characters. Then when they develop software they will, at the start, be 
thinking beyond ASCII and Unicode BMP characters.

André Schappo

> On 24 Aug 2017, at 17:45, Shriramana Sharma  wrote:
> 
> So how do you think it matters if the characters are in the BMP or SMP?




Re: Unicode education in Schools

2017-08-24 Thread Shriramana Sharma via Unicode
So how do you think it matters if the characters are in the BMP or SMP?