Re: ISO 15924 : missing indication of support for Syriac variants

2019-07-17 Thread Asmus Freytag via Unicode

  
  
On 7/17/2019 6:03 PM, Richard
  Wordingham via Unicode wrote:


  On Thu, 18 Jul 2019 01:54:52 +0200
Philippe Verdy via Unicode  wrote:


  
In fact the ligatures system for the "cursive" Egyptian Hieratic is so
complex (and may also have its own variants showing its progression
from Hieroglyphs to Demotic or Old Coptic), that probably Hieratic
should no longer be considered "unified" with Hieroglyphs, and its
existing ISO 15924 code is then not represented at all in Unicode.

  
  
Writing hieroglyphic text as plain text has only been supported since
Unicode 12.0, so it may take a little while to explore workable encoding
conventions.

A significant issue is that the hieratic script is right to left but
Unicode only standardises the encoding of left-to-right
transcriptions.  I don't recall the difference between retrograde v.
normal text being declared a style difference.

Use directional overrides. Those have been in the standard
  forever. 

A./


  

For comparison, we still have no guidance on how to encode sexagesimal
Mesopotamian cuneiform numbers, e.g. '610' v. '20' written using the U
graphic element.

Richard.





  



Unicode's got a new logo?

2019-07-17 Thread Yifán Wáng via Unicode
Hi there,

I cannot help but notice the new home.unicode.org site embraces a new
logo, blue base color with a humanist type, rather than the
traditional one, red and geometric. Does anybody know if it means that
Unicode wants to renew its logo or that they serve for different
purposes? Which should I cite as the official logo? I think I've read
the description and the blog post but couldn't find an explanation.

Thank you.


Re: ISO 15924 : missing indication of support for Syriac variants

2019-07-17 Thread Richard Wordingham via Unicode
On Thu, 18 Jul 2019 01:54:52 +0200
Philippe Verdy via Unicode  wrote:

> In fact the ligatures system for the "cursive" Egyptian Hieratic is so
> complex (and may also have its own variants showing its progression
> from Hieroglyphs to Demotic or Old Coptic), that probably Hieratic
> should no longer be considered "unified" with Hieroglyphs, and its
> existing ISO 15924 code is then not represented at all in Unicode.

Writing hieroglyphic text as plain text has only been supported since
Unicode 12.0, so it may take a little while to explore workable encoding
conventions.

A significant issue is that the hieratic script is right to left but
Unicode only standardises the encoding of left-to-right
transcriptions.  I don't recall the difference between retrograde v.
normal text being declared a style difference.

For comparison, we still have no guidance on how to encode sexagesimal
Mesopotamian cuneiform numbers, e.g. '610' v. '20' written using the U
graphic element.

Richard.


Re: Removing accents and diacritics from a word

2019-07-17 Thread Asmus Freytag (c) via Unicode

On 7/17/2019 11:25 AM, Sławomir Osipiuk wrote:


“Transliteration”?

Maybe more generic that what you’re looking for. Used for the process 
of producing the “machine readable zone” on passports:


https://www.icao.int/publications/Documents/9303_p3_cons_en.pdf (see 
section 6, page 30)


“Accent folding” or “diacritic folding” is used in some places. String 
folding is “A string transform F, with the property that repeated 
applications of the same function F produce the same output: F(F(S)) = 
F(S) for all input strings S”. Accent folding is a special case of that.


https://unicode.org/reports/tr23/#StringFunctionClassificationDefinitions

https://alistapart.com/article/accent-folding-for-auto-complete/

Diacritic folding. Thanks. Just didn't think of the operation as folding 
the way it came up, but that's what it is.


A./


*From:*Unicode [mailto:unicode-boun...@unicode.org] *On Behalf Of 
*Asmus Freytag via Unicode

*Sent:* Wednesday, July 17, 2019 13:38
*To:* Unicode Mailing List
*Subject:* Removing accents and diacritics from a word

A question has come up in another context:

Is there any linguistic term for describing the process of removing 
accents and diacritics from a word to create its “base form”, e.g. São 
Tomé to Sao Tome?


The linguistic term "string normalization" appears not that preferable 
in a computing context.


Any ideas?

A./







Re: ISO 15924 : missing indication of support for Syriac variants

2019-07-17 Thread Philippe Verdy via Unicode
But my concern is in fact valid as well for Egyptian Hieratic (considered
in Chapter 14 to be "unified" with the Hieroglyphs, and being a cursive
variant, currently not supported in any font because of the very complex
set of ligatures this would require, and that may not even work properly
with the existing markup notations used with Hieroglyphs).
But if the "Manuel de codage" for Egyptian Hieroglyphs (describing a markup
notation) contains extensions to represent the Hieratic variants with the
unified Hieroglyphs, then the Unicode version (age) used for Hieroglyphs
should also be assigned to Hieratic.

In fact the ligatures system for the "cursive" Egyptian Hieratic is so
complex (and may also have its own variants showing its progression from
Hieroglyphs to Demotic or Old Coptic), that probably Hieratic should no
longer be considered "unified" with Hieroglyphs, and its existing ISO 15924
code is then not represented at all in Unicode.

For now ISO 15924 still does not consider Egyptian Hieratic to be "unified"
with Egyptian Hieroglyphs; this is not indicated in its descriptive names
given in English or French with a suffix like "(cursive variant of Egyptian
Hieroglyphs)", and it has no "Unicode Age" version given, as if it was
still not encoded at all by Unicode, and then Chapter 14 of the standard
(in its section about Hieroglyphs where Hieratic is cited once) is probably
misleading, waiting for further studies.

And I'm unable to find any non-proprietary (interoperable?) attempt to
encode Hieratic, the only attempts being with Hieroglyphs.

Le jeu. 18 juil. 2019 à 01:16, Philippe Verdy  a écrit :

> Sorry I misread (with an automated tool) an old dataset where these "3.0"
> versions were indicated in an incorrect form
>
> Le jeu. 18 juil. 2019 à 01:07, Philippe Verdy  a
> écrit :
>
>> Note also that there are variants registered with Unicode versions (Age)
>> for symbols, even if they don't have any assigned Unicode alias, but this
>> is not a problem.
>> 994 Zinh Code for inherited script codet pour écriture héritée Inherited
>> 2009-02-23
>> 995 *Zmth * 
>> Mathematical
>> notation notation mathématique 3.2 2007-11-26
>> 993 *Zsye * Symbols
>> (Emoji variant) symboles (variante émoji) 6.0 2015-12-16
>> 996 *Zsym
>> *
>> Symbols symboles 1.1 2007-11-26
>> The Unicode version is an important information which allows determining
>> that texts created in a given language (or notation), and written in these
>> scripts, can be written using the UCS.
>>
>> Weren't the 3 variants of Syriac unified in Unicode (even if they may be
>> distinguished in ISO 15924, for example to allow selecting a suitable but
>> preferred sets of fonts, like this is commonly used for Chinese Mandarin,
>> Arabic, Japanese, Korean or Latin) ?
>>
>>
>> Le jeu. 18 juil. 2019 à 00:55, Philippe Verdy  a
>> écrit :
>>
>>> The ISO 15924/RA reference page contains indication of support in
>>> Unicode for variants of various scripts such as Aran, Latf, Latg, Hanb,
>>> Hans, Hant:.
>>> 160 *Arab* Arabic arabe Arabic 1.1 2004-05-01
>>> 161 *Aran* Arabic (Nastaliq variant) arabe (variante nastalique) 1.1
>>> 2014-11-15
>>> ...
>>> 503 *Hanb* Han with Bopomofo (alias for Han + Bopomofo) han avec
>>> bopomofo (alias pour han + bopomofo) 1.1 2016-01-19
>>>
>>> 500 *Hani* Han (Hanzi, Kanji, Hanja) idéogrammes han (sinogrammes) Han
>>> 1.1 2009-02-23
>>>
>>> 501 *Hans* Han (Simplified variant) idéogrammes han (variante
>>> simplifiée) 1.1 2004-05-29
>>> 502 *Hant* Han (Traditional variant) idéogrammes han (variante
>>> traditionnelle) 1.1 2004-05-29
>>> ...
>>> 217 *Latf* Latin (Fraktur variant) latin (variante brisée) 1.1
>>> 2004-05-01
>>> 216 *Latg* Latin (Gaelic variant) latin (variante gaélique) 1.1
>>> 2004-05-01
>>> 215 *Latn* Latin latin Latin 1.1 2004-05-01
>>> ...
>>> There are other entries for aliases or mixed script also for Japanese
>>> and Korean.
>>>
>>> But for Syriac variants this is missing and this is the only script for
>>> which this occurs:
>>> 135 *Syrc* Syriac syriaque Syriac 3.0 2004-05-01
>>> 138 Syre Syriac (Estrangelo variant) syriaque (variante estranghélo)
>>> 2004-05-01
>>> 137 Syrj Syriac (Western variant) syriaque (variante occidentale)
>>> 2004-05-01
>>> 136 Syrn Syriac (Eastern variant) syriaque (variante orientale)
>>> 2004-05-01
>>> Why is there no Unicode version given for these 3 variants ?
>>>
>>>


Re: Removing accents and diacritics from a word

2019-07-17 Thread Asmus Freytag (c) via Unicode

On 7/17/2019 11:37 AM, Tex wrote:


Asmus, are you including the case where an accented character maps to 
two unaccented characters?


e.g. Å to AA or Ä to AE

If that's covered by the same term; but it's not simple 
"typewriter/telegraph" fallback.





*From:*Unicode [mailto:unicode-boun...@unicode.org] *On Behalf Of 
*Asmus Freytag (c) via Unicode

*Sent:* Wednesday, July 17, 2019 11:07 AM
*To:* Norbert Lindenberg
*Cc:* Unicode Mailing List
*Subject:* Re: Removing accents and diacritics from a word

On 7/17/2019 11:02 AM, Norbert Lindenberg wrote:

“Misspelling”?

Not helpful. Anybody have a serious suggestion?

A./

On Jul 17, 2019, at 10:37, Asmus Freytag via Unicode  
  wrote:

A question has come up in another context:

Is there any linguistic term for describing the process of removing 
accents and diacritics from a word to create its “base form”, e.g. São Tomé to 
Sao Tome?

The linguistic term "string normalization" appears not that preferable 
in a computing context.

Any ideas?

A./





Re: ISO 15924 : missing indication of support for Syriac variants

2019-07-17 Thread Philippe Verdy via Unicode
Sorry I misread (with an automated tool) an old dataset where these "3.0"
versions were indicated in an incorrect form

Le jeu. 18 juil. 2019 à 01:07, Philippe Verdy  a écrit :

> Note also that there are variants registered with Unicode versions (Age)
> for symbols, even if they don't have any assigned Unicode alias, but this
> is not a problem.
> 994 Zinh Code for inherited script codet pour écriture héritée Inherited
> 2009-02-23
> 995 *Zmth * 
> Mathematical
> notation notation mathématique 3.2 2007-11-26
> 993 *Zsye * Symbols
> (Emoji variant) symboles (variante émoji) 6.0 2015-12-16
> 996 *Zsym
> *
> Symbols symboles 1.1 2007-11-26
> The Unicode version is an important information which allows determining
> that texts created in a given language (or notation), and written in these
> scripts, can be written using the UCS.
>
> Weren't the 3 variants of Syriac unified in Unicode (even if they may be
> distinguished in ISO 15924, for example to allow selecting a suitable but
> preferred sets of fonts, like this is commonly used for Chinese Mandarin,
> Arabic, Japanese, Korean or Latin) ?
>
>
> Le jeu. 18 juil. 2019 à 00:55, Philippe Verdy  a
> écrit :
>
>> The ISO 15924/RA reference page contains indication of support in Unicode
>> for variants of various scripts such as Aran, Latf, Latg, Hanb, Hans, Hant:.
>> 160 *Arab* Arabic arabe Arabic 1.1 2004-05-01
>> 161 *Aran* Arabic (Nastaliq variant) arabe (variante nastalique) 1.1
>> 2014-11-15
>> ...
>> 503 *Hanb* Han with Bopomofo (alias for Han + Bopomofo) han avec
>> bopomofo (alias pour han + bopomofo) 1.1 2016-01-19
>>
>> 500 *Hani* Han (Hanzi, Kanji, Hanja) idéogrammes han (sinogrammes) Han
>> 1.1 2009-02-23
>>
>> 501 *Hans* Han (Simplified variant) idéogrammes han (variante simplifiée)
>> 1.1 2004-05-29
>> 502 *Hant* Han (Traditional variant) idéogrammes han (variante
>> traditionnelle) 1.1 2004-05-29
>> ...
>> 217 *Latf* Latin (Fraktur variant) latin (variante brisée) 1.1 2004-05-01
>> 216 *Latg* Latin (Gaelic variant) latin (variante gaélique) 1.1
>> 2004-05-01
>> 215 *Latn* Latin latin Latin 1.1 2004-05-01
>> ...
>> There are other entries for aliases or mixed script also for Japanese and
>> Korean.
>>
>> But for Syriac variants this is missing and this is the only script for
>> which this occurs:
>> 135 *Syrc* Syriac syriaque Syriac 3.0 2004-05-01
>> 138 Syre Syriac (Estrangelo variant) syriaque (variante estranghélo)
>> 2004-05-01
>> 137 Syrj Syriac (Western variant) syriaque (variante occidentale)
>> 2004-05-01
>> 136 Syrn Syriac (Eastern variant) syriaque (variante orientale)
>> 2004-05-01
>> Why is there no Unicode version given for these 3 variants ?
>>
>>


Re: ISO 15924 : missing indication of support for Syriac variants

2019-07-17 Thread Philippe Verdy via Unicode
Note also that there are variants registered with Unicode versions (Age)
for symbols, even if they don't have any assigned Unicode alias, but this
is not a problem.
994 Zinh Code for inherited script codet pour écriture héritée Inherited
2009-02-23
995 *Zmth *
Mathematical
notation notation mathématique 3.2 2007-11-26
993 *Zsye * Symbols
(Emoji variant) symboles (variante émoji) 6.0 2015-12-16
996 *Zsym
*
Symbols symboles 1.1 2007-11-26
The Unicode version is an important information which allows determining
that texts created in a given language (or notation), and written in these
scripts, can be written using the UCS.

Weren't the 3 variants of Syriac unified in Unicode (even if they may be
distinguished in ISO 15924, for example to allow selecting a suitable but
preferred sets of fonts, like this is commonly used for Chinese Mandarin,
Arabic, Japanese, Korean or Latin) ?


Le jeu. 18 juil. 2019 à 00:55, Philippe Verdy  a écrit :

> The ISO 15924/RA reference page contains indication of support in Unicode
> for variants of various scripts such as Aran, Latf, Latg, Hanb, Hans, Hant:.
> 160 *Arab* Arabic arabe Arabic 1.1 2004-05-01
> 161 *Aran* Arabic (Nastaliq variant) arabe (variante nastalique) 1.1
> 2014-11-15
> ...
> 503 *Hanb* Han with Bopomofo (alias for Han + Bopomofo) han avec bopomofo
> (alias pour han + bopomofo) 1.1 2016-01-19
>
> 500 *Hani* Han (Hanzi, Kanji, Hanja) idéogrammes han (sinogrammes) Han 1.1
> 2009-02-23
>
> 501 *Hans* Han (Simplified variant) idéogrammes han (variante simplifiée)
> 1.1 2004-05-29
> 502 *Hant* Han (Traditional variant) idéogrammes han (variante
> traditionnelle) 1.1 2004-05-29
> ...
> 217 *Latf* Latin (Fraktur variant) latin (variante brisée) 1.1 2004-05-01
> 216 *Latg* Latin (Gaelic variant) latin (variante gaélique) 1.1 2004-05-01
> 215 *Latn* Latin latin Latin 1.1 2004-05-01
> ...
> There are other entries for aliases or mixed script also for Japanese and
> Korean.
>
> But for Syriac variants this is missing and this is the only script for
> which this occurs:
> 135 *Syrc* Syriac syriaque Syriac 3.0 2004-05-01
> 138 Syre Syriac (Estrangelo variant) syriaque (variante estranghélo)
> 2004-05-01
> 137 Syrj Syriac (Western variant) syriaque (variante occidentale)
> 2004-05-01
> 136 Syrn Syriac (Eastern variant) syriaque (variante orientale) 2004-05-01
> Why is there no Unicode version given for these 3 variants ?
>
>


ISO 15924 : missing indication of support for Syriac variants

2019-07-17 Thread Philippe Verdy via Unicode
The ISO 15924/RA reference page contains indication of support in Unicode
for variants of various scripts such as Aran, Latf, Latg, Hanb, Hans, Hant:.
160 *Arab* Arabic arabe Arabic 1.1 2004-05-01
161 *Aran* Arabic (Nastaliq variant) arabe (variante nastalique) 1.1
2014-11-15
...
503 *Hanb* Han with Bopomofo (alias for Han + Bopomofo) han avec bopomofo
(alias pour han + bopomofo) 1.1 2016-01-19

500 *Hani* Han (Hanzi, Kanji, Hanja) idéogrammes han (sinogrammes) Han 1.1
2009-02-23

501 *Hans* Han (Simplified variant) idéogrammes han (variante simplifiée)
1.1 2004-05-29
502 *Hant* Han (Traditional variant) idéogrammes han (variante
traditionnelle) 1.1 2004-05-29
...
217 *Latf* Latin (Fraktur variant) latin (variante brisée) 1.1 2004-05-01
216 *Latg* Latin (Gaelic variant) latin (variante gaélique) 1.1 2004-05-01
215 *Latn* Latin latin Latin 1.1 2004-05-01
...
There are other entries for aliases or mixed script also for Japanese and
Korean.

But for Syriac variants this is missing and this is the only script for
which this occurs:
135 *Syrc* Syriac syriaque Syriac 3.0 2004-05-01
138 Syre Syriac (Estrangelo variant) syriaque (variante estranghélo)
2004-05-01
137 Syrj Syriac (Western variant) syriaque (variante occidentale) 2004-05-01
136 Syrn Syriac (Eastern variant) syriaque (variante orientale) 2004-05-01
Why is there no Unicode version given for these 3 variants ?


RE: Removing accents and diacritics from a word

2019-07-17 Thread Tex via Unicode
Asmus, are you including the case where an accented character maps to two 
unaccented characters?

 

e.g. Å to AA or Ä to AE

 

From: Unicode [mailto:unicode-boun...@unicode.org] On Behalf Of Asmus Freytag 
(c) via Unicode
Sent: Wednesday, July 17, 2019 11:07 AM
To: Norbert Lindenberg
Cc: Unicode Mailing List
Subject: Re: Removing accents and diacritics from a word

 

On 7/17/2019 11:02 AM, Norbert Lindenberg wrote:

“Misspelling”?

Not helpful. Anybody have a serious suggestion?

A./

 
 
 

On Jul 17, 2019, at 10:37, Asmus Freytag via Unicode  
  wrote:
 
A question has come up in another context:
 
Is there any linguistic term for describing the process of removing accents and 
diacritics from a word to create its “base form”, e.g. São Tomé to Sao Tome?
 
The linguistic term "string normalization" appears not that preferable in a 
computing context.
 
Any ideas?
 
A./
 
 

 
 

 



RE: Removing accents and diacritics from a word

2019-07-17 Thread Sławomir Osipiuk via Unicode
“Transliteration”?

Maybe more generic that what you’re looking for. Used for the process of 
producing the “machine readable zone” on passports:

https://www.icao.int/publications/Documents/9303_p3_cons_en.pdf (see section 6, 
page 30)

 

“Accent folding” or “diacritic folding” is used in some places. String folding 
is “A string transform F, with the property that repeated applications of the 
same function F produce the same output: F(F(S)) = F(S) for all input strings 
S”. Accent folding is a special case of that.

https://unicode.org/reports/tr23/#StringFunctionClassificationDefinitions

https://alistapart.com/article/accent-folding-for-auto-complete/

 

 

From: Unicode [mailto:unicode-boun...@unicode.org] On Behalf Of Asmus Freytag 
via Unicode
Sent: Wednesday, July 17, 2019 13:38
To: Unicode Mailing List
Subject: Removing accents and diacritics from a word

 

A question has come up in another context:

Is there any linguistic term for describing the process of removing accents and 
diacritics from a word to create its “base form”, e.g. São Tomé to Sao Tome?

The linguistic term "string normalization" appears not that preferable in a 
computing context.

Any ideas?

A./









Re: Removing accents and diacritics from a word

2019-07-17 Thread Asmus Freytag (c) via Unicode

On 7/17/2019 11:02 AM, Norbert Lindenberg wrote:

“Misspelling”?


Not helpful. Anybody have a serious suggestion?

A./





On Jul 17, 2019, at 10:37, Asmus Freytag via Unicode  
wrote:

A question has come up in another context:

Is there any linguistic term for describing the process of removing accents and 
diacritics from a word to create its “base form”, e.g. São Tomé to Sao Tome?

The linguistic term "string normalization" appears not that preferable in a 
computing context.

Any ideas?

A./








Removing accents and diacritics from a word

2019-07-17 Thread Asmus Freytag via Unicode

  
  
A question has come up in another
context:
  
Is there any linguistic term for
describing the process of removing accents and diacritics from a
word to create its “base form”, e.g. São Tomé to Sao Tome?
The linguistic term "string normalization" appears not that
  preferable in a computing context.
Any ideas?

A./