Re: Combining Marks and Variation Selectors

2020-02-02 Thread Eric Muller via Unicode

  
  
That would imply some coordination
  among variations sequences on different code points, right?
  
  E.g. <0B48> ≡ <0B47, 0B56>, so a variation sequence on
  0B56 (Mn, ccc=0) would imply the existence of a variation sequence
  on 0B48 with the same variation selector, and the same effect.
  
  Eric.
  
  On 2/2/2020 11:43 AM, Mark Davis ☕️ via Unicode wrote:


  
  
I don't think there is a technical reason for
  disallowing variation selectors after any starters (ccc=000);
  the normalization algorithm doesn't care about the general
  category of characters.



  

  

  

  

  

Mark
  
  

  

  

  

  

  

  


  
  
  
On Sun, Feb 2, 2020 at 10:09
  AM Richard Wordingham via Unicode 
  wrote:

On
  Sun, 2 Feb 2020 07:51:56 -0800
  Ken Whistler via Unicode  wrote:
  
  > What it comes down to is avoidance of conundrums
  involving canonical 
  > reordering for normalization. The effect of variation
  selectors is 
  > defined in terms of an immediate adjacency. If you
  allowed variation 
  > selectors to be defined for combining marks of ccc!=0,
  then 
  > normalization of sequences could, in principle, move the
  two apart.
  > That would make implementation of the intended rendering
  much more
  > difficult.
  
  I can understand that for non-starters.  However, a lot of
  non-spacing
  combining marks are starters (i.e. ccc=0), so they would not
  be a
  problem.   is an
  unbreakable block in
  canonical equivalence-preserving changes.  Is this restriction
  therefore
  just a holdover from when canonical equivalence could be
  corrected?
  
  Richard.

  


  



Re: Keyboard layouts and CLDR

2018-01-30 Thread Eric Muller via Unicode

Indeed.

But "Faÿ-lès-Nemours" / "FAŸ-LÈS-NEMOURS". "lès" in French place names 
means "near", typically followed by another city name or a river name.


In the case of "L'Haÿ-les-Roses", it's just that they have a famous rose 
garden, so "les".


Eric.

On 1/30/2018 12:06 AM, Martin J. Dürst via Unicode wrote:

On 2018/01/30 16:18, Philippe Verdy via Unicode wrote:

  - Adding Y to the list of allowed letters after the dieresis 
deadkey to
produce "Ÿ" : the most frequent case is L'HAŸE-LÈS-ROSES, the 
official name
of a French municipality when written with full capitalisation, 
almost all
spell checkers often forget to correct capitalized names such as this 
one.


Wikipedia has this as L'Haÿ-les-Roses (see 
https://fr.wikipedia.org/wiki/L'Haÿ-les-Roses). It surely would be 
L'HAŸ-LES-ROSES, and not L'HAŸE-LÈS-ROSES, when capitalized. I of 
course know of the phenomenon that in French, sometimes the accents on 
upper-case letters are left out, but I haven't heard of a reverse 
phenomenon yet.


Regards,   Martin.





0027, 02BC, 2019, or a new character?

2018-01-15 Thread Eric Muller via Unicode

https://www.nytimes.com/2018/01/15/world/asia/kazakhstan-alphabet-nursultan-nazarbayev.html

Eric.



Re: Database missing/erroneous information

2017-07-12 Thread Eric Muller via Unicode

  
  
In the .grouped.xml file, if a
   does not have an attribute, it inherits it from its
  containing  element. The group containing the digits
  has  IDC="Y" OIDC="N" XIDC="Y", and so that applies to the digits
  as well.
  
  If you don't want to deal with the inheritance mechanism, just use
  the .flat.xml files, the  elements carry all the
  attributes.
  
  Eric.
  
  
  On 7/12/2017 6:35 AM, J Decker via Unicode wrote:


  I started looking more deeply at the _javascript_
specification.  Identifiers are defined as starting with
characters with ID_Start and continued with ID_Continue
attributes.
I grabbed the xml database (ucd.all.grouped.xml )  in which
  I was able to find IDS, IDC flags ( also OIDS,OIDC, XIDS,XIDC
  of which meaning I'm not entirely sure of)


but I started filtering out to find characters that are NOT
  IDS|IDC 


Something simple like numbers 0x30-0x39 are marked with
  IDS='N' but have no [ OX]IDC flags specified.  Is a lack of
  flag assumed N or Y?  www.unicode.org/reports/tr42/
  documentation on the XML file format doesn't specify.


http://www.unicode.org/reports/tr31/
   I see 'ID_Continue characters include ID_Start characters,
  plus characters '


most languages do support identifiers like a1, a2, etc as
  valid identifiers, so certainly numbers should have IDC even
  though they're not IDS.  
Are there characters that are IDS without being IDC?  There
  are certainly characters that are IDC without IDS.




some examples.

  found  char { cp: '0034',  na: 'DIGIT FOUR',  gc: 'Nd',
 nt: 'De',  nv: '4',  bc: 'EN',  lb: 'NU',  sc: 'Zyyy',
 scx: 'Zyyy',  Alpha: 'N',  Hex: 'Y',  AHex: 'Y',  IDS: 'N',
 XIDS: 'N',  WB: 'NU',  SB: 'NU',  Cased: 'N',  CWCM: 'N',
 InSC: 'Number' }



(this has IDC notation but not IDS; since it says 'digit' I
  assume this is a number type, and should not be IDS.)

  found  char { cp: '0F32',  na: 'TIBETAN DIGIT HALF NINE',
 gc: 'No',  nt: 'Nu',  nv: '17/2',  Alpha: 'N',  IDC: 'N',
 XIDC: 'N',  SB: 'XX',  InSC: 'Number' }



This might be not IDS but is IDC?

  found  char { cp: '203F',
    na: 'UNDERTIE',
    gc: 'Pc',
    IDC: 'Y',
    XIDC: 'Y',
    Pat_Syn: 'N',
    WB: 'EX' }




  this is sort of IDS but not IDC?
  found  char { cp: '309B',  na: 'KATAKANA-HIRAGANA VOICED
SOUND MARK',  gc: 'Sk',  dt: 'com',  dm: '0020 3099',  bc:
'ON',  lb: 'NS',  sc: 'Zyyy',  scx: 'Hira Kana',  Alpha:
'N',  Dia: 'Y',  OIDS: 'Y',  XIDS: 'N',  XIDC: 'N',  WB:
'KA',  SB: 'XX',  NFKC_QC: 'N',  NFKD_QC: 'N',  XO_NFKC:
'Y',  XO_NFKD: 'Y',  CI: 'Y',  CWKCF: 'Y',  NFKC_CF: '0020
3099',  vo: 'Tu' }



  


  



Bengla syllables <... 09BF 09BE> and <... 09BF 09C0>

2017-02-07 Thread Eric Muller
In looking at the wiki{pedia,book.source,tionary} corpus for Bengla, I 
see a relatively large number of syllables with  <... 09BF 09BE> or <... 
09BF 09C0>. I checked a couple of sources, and I did not find them 
listed anywhere as being normally used.


Are they in normal use or are those all typos?

I did not find any occurrence in the Assamese corpus.

Thanks,
Eric.

The syllables (o is the number of occurrences):














































o='54'/>

















o='93'/>
o='171'/>


o='238'/>
o='79'/>






























o='75'/>





o='157'/>









o='125'/>
o='118'/>
o='58'/>






how would you state requirements involving sorting?

2017-01-23 Thread Eric Muller

  
  
Suppose you help somebody write requirements for a piece of software
and you see an item:

Sorting. Diacritic marks need to be stripped when
  sorting titles


You know that sorting is a lot more complicated than removing
diacritics, and that giving the directive above to a naive developer
is going to lead to trouble. You know you want to end up with an
implementation involving the UCA with a tailoring based on the
locale. How would you suggest to reword the requirement?

Thanks,
Eric.

  



Re: "textels"

2016-09-16 Thread Eric Muller

On 9/16/2016 8:30 AM, Janusz S. Bien wrote:
Quote/Cytat - Eric Muller <eric.mul...@efele.net> (pią, 16 wrz 2016, 
17:03:54):



On 9/16/2016 6:52 AM, Janusz S. Bień wrote:

(when working on a corpus of historical Polish we
noticed some cases where standard Unicode equivalence was not
convenient).


I'm very interested to know more about those cases.


For our search engine we were unable to use compatibility equivalence 
"out of the box" for splitting the ligature because it also converted 
long s to short s while we wanted to preserve the distinction.


I am interested in the problems with *canonical* equivalence. I thought 
that you were talking about those before.


Compatibility equivalence is a completely different beast. It is, IMHO, 
too coarse a tool and best forgotten. For any particular task, it's 
typically doing too much (e.g. long/short s folding in your case) and 
too little (not everything you need). There was an attempt at improving 
the situation, by providing a whole bunch of fine grained, targeted 
transformations (http://www.unicode.org/reports/tr30/), but that did not 
pan out.


Eric.



Thanks,
Eric.



Re: "textels"

2016-09-16 Thread Eric Muller

On 9/16/2016 6:52 AM, Janusz S. Bień wrote:

(when working on a corpus of historical Polish we
noticed some cases where standard Unicode equivalence was not
convenient).


I'm very interested to know more about those cases.

Thanks,
Eric.



Emoji Feminism - The New York Times

2016-03-13 Thread Eric Muller

http://www.nytimes.com/2016/03/13/opinion/sunday/emoji-feminism.html?_r=0


The Chinese Typewriter: The Design and Science of East Asian Information Technology

2016-01-16 Thread Eric Muller

  
  
For those who are in the San Francisco Bay Area:

https://library.stanford.edu/eal


  The
  Chinese Typewriter: The Design and Science of East Asian
  Information Technology


  


  
During the 19th and 20th centuries, groundbreaking
  information technologies like the telegraph, the typewriter,
  and the computer changed the world. All of these technologies
  were designed with the alphabet in mind, however, leaving open
  the question: what about China, Japan, and Korea? In this
  exhibition, the history of modern East Asian information
  technology is explored through artifacts from the personal
  collection of Professor Thomas S. Mullaney (History) and the
  Stanford East Asia Library. Opening Reception and Guest
  Lectures by Jidong Yang (EAL) and Thomas S. Mullaney (History)
  on Wednesday, January 20 at 5pm.
The exhibition is open from January 20, 2016 to September 10,
  2016.
Location: Lathrop East Asia Library - Map Link
Audience: General Public, Faculty/Staff, Students,
  Alumni/Friends, Members
Sponsor: Stanford University Libraries, History Department,
  Program in History and Philosophy of Science and Technology,
  Center for East Asian Studies, Department of East Asian
  Languages and Cultures
  


  



Re: Proposal for German capital letter "ß"

2015-12-10 Thread Eric Muller

On 12/10/2015 2:45 AM, Frédéric Grosshans wrote:

Le 10/12/2015 05:32, Martin J. Dürst a écrit :
A similar example is the use of accents on upper-case letters in 
French in France where 'officially', upper-case letters are written 
without accents.
Actually, the official body in charge of this (Académie Française) 


They actually mandate "Académie *f*rançaise".  And "Imprimerie 
*n*ationale" (for Philippe; even if imprimerienationale.fr has forgotten 
that).


has always recommended upper-case letters with accents , but the 
school teachers teach the other way, and accents on capital letters 
was technically challenging (in printing, writing machines and keyboard),


Thanks to gallica.fr and archive.org, it is easy to see what actually 
happened until the middle of the 20th century. What I have seen is that 
in both cold and hot metal, until the end of the 19th century, one only 
and always sees É È Ê Ë Ç Œ Æ; on small caps, one can sometime find À  
Ô Ù. That matches all the descriptions of the "casse parisienne" and 
"police" (how many "a", "b", "c", etc in a font) I have seen in 
typography manuals. Around the beginning of the 20th century, one start 
to see books without accented capitals (and unfortunately books with 
inconsistent use of the accented capitals).


Eric.



Toki Pona: A Language With a Hundred Words - The Atlantic

2015-07-28 Thread Eric Muller

http://www.theatlantic.com/technology/archive/2015/07/toki-pona-smallest-language/398363/

Eric.



Re: UDHR in Unicode: 400 translations in text form!

2015-06-29 Thread Eric Muller

On 6/28/2015 12:30 PM, Ken Shirriff wrote:
I don't mean to be critical, but I find the UDHR page is really hard 
to use.





Thanks for the observations. I'll try to find a better organization.

Eric.



Re: UDHR in Unicode: 400 translations in text form!

2015-06-29 Thread Eric Muller

On 6/28/2015 12:20 PM, Philippe Verdy wrote:
Note: The marker icons showing languages in the Leaflet component 
(over the OSM map) are not working (broken links)


Fixed, I believe.


Also the locations assigned of some international languages is strange:

Esperanto ... Picard ... Standard French


These locations for those come from http://glottolog.org. Unless those 
locations are obviously wrong, I'd prefer to keep them aligned.


 But in fact I would have placed those international languages 
somewhere in the middle of an ocean, just aligned vertically in a list 
along a meridian (across the Atlantic or Pacific for example)


A few are already in Antarctica. I'll move Esperanto and Interlingua there.



Some languages do have an ISO 639-3 code. E.g.
- Tetum, official in Timor-Leste, is currently coded as 010 
(mapped to und in ISO 639-3), it should be tet.


In general, identification of the language of the translations is not 
trivial. I have learned to not trust just the names provided with the 
translations.


For this one, there is another translation, [tet], which most likely is 
tet/Tetun. [010] looks like a fairly different language and it is not 
clear to me that it is Tetun. I'd rather have some informed 
recommendation before assigning a language to [010]. It does not help 
that the source site does not seem accessible right now.



- Forro (Saotomense) is a Portuguese-based creole in Sao Tome, 
currently coded as 007 (mapped to und), it should use cri.


The OHCHR site warns: not to confuse Crioulo Santomense with Santomense 
(a variety and dialect of Portuguese in São Tomé and Príncipe) Again, 
I'd prefer some informed recommendation.




- Kimbundu should also use kmb and not 009
- Umbundo (Umbundu) should also use umb and not 011


According to the Ethnologue, both Kimbundu and Umbundu are used both as 
language names and as family names. Given that I don't really trust the 
sources of those names, I'd prefer some informed recommendation.


Thanks,
Eric.



Re: UDHR in Unicode: 400 translations in text form!

2015-06-29 Thread Eric Muller

On 6/28/2015 10:24 PM, Leo Broukhis wrote:

Ukrainian is in Estonia, Estonian is in the Baltic sea.


I took the locations from glottolog.org. The first error is mine, I 
mistyped a value. The second error comes from Glottolog, I corrected and 
reported to them.


Will appear in the next update.

Thanks,
Eric.



UDHR in Unicode: 400 translations in text form!

2015-06-28 Thread Eric Muller
I am pleased to announce that the UDHR in Unicode project 
(http://unicode.org/udhr) has reached a notable milestone: we now have 
400 translations of the Universal Declaration of Human Rights in text form.


The latest translation is in Sinhala, thanks to Keshan Sodimana, Pasundu 
de Silva and Sascha Brawer. Many thanks to them and to all the contributors.


There is still plenty of work: most translations would benefit from a 
review, and there are 55 translations for which we have PDFs or images, 
but not yet the text form (look for stage 2 translations).


The site has also been revamped a bit, with a more functional map, and a 
more functional table of the translations. The mapping to ISO 639-3 and 
BCP 47 have been updated to take into account the evolution of those 
standards.


Again, thanks to all the contributors, past, present and future,

Eric.

PS: I believe I have taken care of all the backlog of contributions and 
comments. If I missed something, sorry, and please ping me again.


Re: WORD JOINER vs ZWNBSP

2015-06-26 Thread Eric Muller

  
  
On 6/26/2015 3:48 AM, Marcel Schneider
  wrote:

To do traditional French typography on the
PC,

or anywhere

 a justifying no-break space is needed along
with the colon, because this punctuation must be placed in the
middle between the word it belongs to and the following word.

Actually, it's non-justifying and it's thin. U+202F ‘ ’ NARROW
NO-BREAK SPACE is your friend.

Eric.

  



Help with African characters, please

2015-06-21 Thread Eric Muller
Can you help me identify the characters used in the Kulango, Bouna 
translation of the UDHR?


The text is at 
http://www.ohchr.org/EN/UDHR/Documents/UDHR_Translations/kou.pdf. Look 
for article 14.


What is the second letter of the word for article (after the N, looks 
like a greek nu), and what is the second letter of the first word (after 
the M, looks similar but different)?


What is the letter that looks somewhat like an epsilon (but compare with 
the epsilon like in articles 13 and 15)?


Thanks,
Eric.



Re: Another take on the English apostrophe in Unicode

2015-06-12 Thread Eric Muller

  
  
On 6/10/2015 9:37 PM, Philippe Verdy
  wrote:


  The French "pomme de terre" ("potato" in English,
French vulgar synonym : "patate") is a single lemma in
dictionaries, but is still 3 separate words (only the first one
takes the plural mark), it is not considered a "nom composé" (so
there's no hyphens).



Grevisse, Le bon usage, 11th edition, 1980, page 118, part 1
Elements of the language, chapter 7 The words, section 3 Formation
of new words, article 2, Composition, very first paragraph (179
overall):

---
By composition, language creates new words, either by
combining simple words with existing words, or by preceding these
simple words  with syllables that have no independent existence: Chou-fleur,
  gendarme, pomme de terre, contredire, désunir, paratonnerre.
  

A word, despite being formed of graphically independent
  elements, is composed as soon at it brings to mind, not
  the distinct images of each of the words from which it is
  composed, but a single image. Thus the composites hôtel
de ville, pomme de terre, arc de triomphe each remind of a
  unique image, and not of the distinct images of hôtel and
  of ville, of pomme and of terre, of arc
  and of triomphe. 
  
---

(hôtel de ville = city hall; pomme = apple, de
= of, terre = earth)

Paragraph 181, 3rd remark:

---
Sometimes the elements composing [the word] are welded in a simple
word: Bonheur, contredire, entracte; sometimes they
are connected by an hyphen: chou-fleur, coffre-fort;
sometimes they stay independent graphically: Moyen âge, pomme de
  terre.
  
  ---
  
(“Le Grévisse” as we affectionately call it, or Le bon usage
  / French Grammar with remarks on today’s french language, is a
must-have for the student of French. It is encyclopedic in its
depth, and has tons of examples and counter-examples. Interestingly,
the French wikipedia page says “a descriptive grammar of French”,
while the English wikipedia page says “a prescriptive grammar”; it’s
both!)

I agree that we don’t need a new space coded character. I was just
pointing out that some of the arguments for a new coded character
for the apostrophe in don’t apply equally well to the spaces
in the word pomme de terre.

Eric.

  



Re: ucd beta, stable filenames

2015-06-05 Thread Eric Muller

On 6/5/2015 8:48 AM, Daniel Bünzli wrote:

Hello,

Would it be possible in the future to publish the latest version of the ucd 
files without the -X.Y.ZdW suffixes under a fixed URI like

   http://www.unicode.org/Public/beta/

and/or simply publish it in the version directory but without the suffixes 
(like the ucdxml files do). With the current scheme it hard for implementers to 
automate file downloads for testing with the beta.




+1000

Eric.



Re: Another take on the English apostrophe in Unicode

2015-06-05 Thread Eric Muller

On 6/5/2015 10:29 AM, John D. Burger wrote:

Linguistically, don't and friends pass all the diagnostics that indicate 
they're single words.


If I am not mistaken, the french pomme de terre also passes the 
diagnostics. So we need a new space character.


Eric.



Re: Tag characters

2015-05-26 Thread Eric Muller

  
  
On 5/21/2015 1:25 PM, Asmus Freytag (t)
  wrote:


  
  On 5/21/2015 8:46 AM, Peter Constable
wrote:
  
  




  Would

  Unicode really want to get into the business of running a
  UFL service?

  
  
  I suspect both Eric and I may have have been slightly
  tongue-in-cheek with respect to UFLs...


Actually, I was serious.

Eric.

  



Re: Tag characters

2015-05-20 Thread Eric Muller

On 5/20/2015 7:11 PM, Doug Ewell wrote:
In any event, URLs that point to images would be an awful basis for an 
encoding.


I would make an exception for the URL 
http://unicode.org/Public/8.0.0/ucd/StandardizedFlags.html.


Eric.




Re: Usage stats?

2015-03-27 Thread Eric Muller
Would a corpus like wikipedia or Project Gutenberg be appropriate for 
you purpose ? Both are freely and easily accessible. 
http://dumps.wikimedia.org/backup-index.html and 
http://www.gutenberg.org/wiki/Gutenberg:Feeds#The_Complete_Project_Gutenberg_Catalog.


Eric.

___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Séminaire doctoral Chemins des écritures | Gripic

2014-12-17 Thread Eric Muller

This seminar may be of interest to those in France.

http://www.gripic.fr/evenement/seminaire-doctoral-chemins-ecritures

Eric.

___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: fonts for U7.0 scripts

2014-10-23 Thread Eric Muller



How about even having just the glyphs in the Unicode.org charts being in the 
public domain?


Very easy to achieve:

1. Ask the owner of the font how much money he wants to part with his 
property.

2. Write a check for the corresponding amount.
3. You are now the owner, you can put the font in the public domain.

Eric.

___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Help with Hebrew

2014-07-26 Thread Eric Muller

Many thanks for all the answers on my Hebrew and Arabic questions.

On 7/6/2014 4:18 AM, Matitiahu Allouche wrote:

The original text is interesting, combining French, Latin and Hebrew.


There is also a fair amount of Greek, and a couple of Arabic words.


Unfortunately, the author and/or the type setter were not quite proficient in 
Hebrew, so that the Hebrew words in the 3 referenced pages contain quite a few 
errors.


I think it's a safe assumption that the typesetter was not necessarily 
fluent in Hebrew.


  
I am not sure if the digitization should reproduce faithfully the flaws of the original document, or if it is an opportunity to correct the errors (which may not be possible for the first page).


I want both! In my XML source, I do record things like correction 
original=mistaquemistake/correction, and render that in the EPUBs 
I produce by mistake [mistaque].



1) Eric's representation of the Hebrew words in f274.image seems correct. So 
the Unicode sequences are
Yod (U+05D9) Segol (U+05B6) Dalet (U+05D3) Segol (U+05B6) Alef (U+05D0)
And
Yod (U+05D9) Segol (U+05B6) Dalet (U+05D3) Segol (U+05B6) He (U+05D4)

However, the Hebrew words are suspect:
a. The first one (Yod Dalet Alef) is not a stem known in Hebrew. It could be a 
deformation of the stem Yod Resh Alef whose meaning is to fear (= the French 
craindre).


I would not be surprised if the typesetter confused dalet and resh. The 
good news is that the text I pointed to is one of the many re-editions 
of the work, and we have a facsimile of the original edition:


http://gallica.bnf.fr/ark:/12148/btv1b8626248g/f26.image

Here is seems clear that it's a resh in both examples. By the way, the 
whole sentence reads roughly In Hebrew, there are words which are 
different only in that one ends with an aleph, and the other with a he, 
which are not pronounced, as  which means fear and  which means 
throw away. This follows a discussion that in French, champ and 
chant are pronounced the same, with the final p and t silent.



b. Both grammatical forms (with Segol under the rightmost two letters in both 
words) do not conform to proper conjugation, as far as I know (conjugation of 
Hebrew verbs is not a matter for the faint of heart).


The original edition seem to show a qamats. Would that be better?



2) The case of f299.image is yet more complicated:


The original edition:

http://gallica.bnf.fr/ark:/12148/btv1b8626248g/f53.image


a. If you compare the rightmost letter in the Hebrew word following mais dans with the 
corresponding letter in the Hebrew word following pour, you can see that they don't 
look identical. The first one has a rounded top-right corner while the second one has a more square 
shape. The first letter looks like a Hebrew letter Resh (U+05E8) and the second looks like a Hebrew 
letter Dalet (U+05D3, and it is the correct one).


The original seems to show Dalet in all three cases. Overall, what I see 
there is


- dalet, sheva, bet, patah, resh, space, shin, segol, qof, segol, resh
- shin, segol, qof, segol, resh
- dalet, sheva, bet, patah, resh
- dalet, qamats, bet, qamats, resh

The text is telling how the genitive is marked differently in Latin and 
in Hebrew. In Latin, in verbum falsitatis, it's falsitas that has been 
transformed into falsitatis to mark the genetive, while in Hebrew, it's 
(the word for verbum) that is modified.







c. When a word starts with Dalet, there should generally be a Dagesh in the 
Dalet.


That brings an interesting question. If you look at the French in the 
two editions (1660 and 1810), you will see that they different 
orthographies, and that today's orthography (2014) is yet another one. 
There is no reason this would not happen in the same way for the Hebrew. 
So what I am really after is

- what's on the page
- what was meant to be on the page, when the editions were made (1660, 
1810)
- what one would want to put on the page if one were to make a modern 
edition, with modern orthography throughout


Is it plausible that the dagesh would only be in the last case (modern 
orthography), since it's clearly absent in both facsimiles?

d. The point on the Shin (rightmost letter of the second word) is a Sin Dot, 
while it should be a Shin Dot,


None in the original edition, apparently.


The expression was probably quoted from Exodus XXIII, 7, where the vowel under 
the Bet is a Patah, which is also the way it would be written in modern Hebrew.
So the right sequences (after correcting the errors in the original document) 
are
- Dalet (U+05D3) Dagesh (U+05BC) Sheva (U+05B0) Bet (U+05D1) Patah (U+05B7) 
Resh (U+05E8) Space Shin (U+05E9) Shin Dot (U+05C1) Qamats (U+05B8) Qof 
(U+05E7) Segol (U+05B8) Resh (U+05E8)
- Shin (U+05E9) Shin Dot (U+05C1) Qamats (U+05B8) Qof (U+05E7) Segol 
(U+05B8) Resh (U+05E8)
- Dalet (U+05D3) Dagesh (U+05BC) Sheva (U+05B0) Bet (U+05D1) Patah (U+05B7) 
Resh (U+05E8)
- Dalet (U+05D3) Dagesh (U+05BC) Qamats (U+05B8) Bet (U+05D1) 

Parsers for the UnicodeSet notation?

2014-07-23 Thread Eric Muller
I would like to work with the exemplarCharacters data in the CLDR. That 
uses the UnicodeSet notation. Is there somewhere a parser for that 
notation, that would return me just the list of characters in the set? 
Something a bit like the UnicodeSet utility at 
http://unicode.org/cldr/utility/list-unicodeset.jsp, but for use in 
apps/shell.


I suspect that the exemplarCharacters use a restricted form of the 
UnicodeSet notation (e.g. do not use property values). Is that correct, 
and if so, what's the subset?


Incidentally, I copy/pasted the punctuation exemplar characters for 
he.xml into the utility, and it reported that the set contains 8,130 
code points, including the ascii letters. Somehow, that seems incorrect. 
What did I do wrong?


Thanks,
Eric.


___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Help with arabic

2014-07-05 Thread Eric Muller
I am working of the digitization of a text that includes arabic; could 
somebody please tell me what is the Unicode representation of the 
(short) fragments on those two pages?


http://gallica.bnf.fr/ark:/12148/bpt6k6439352j/f33.image

http://gallica.bnf.fr/ark:/12148/bpt6k6439352j/f474.image

Thanks,
Eric.

___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Time to learn French!

2014-05-08 Thread Eric Muller

http://www.forbes.com/sites/pascalemmanuelgobry/2014/03/21/want-to-know-the-language-of-the-future-the-data-suggests-it-could-be-french/

http://www.france24.com/en/20140326-will-french-be-world-most-spoken-language-2050/

http://www.boston.com/bostonglobe/ideas/brainiac/2014/03/the_language_of_1.html

Et cetera.

Eric.


___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Editing Sinhala and Similar Scripts

2014-03-19 Thread Eric Muller

On 3/19/2014 7:57 AM, Peter Constable wrote:
It is nonsensical to talk about erasing a _keystroke_. 


undo, revert the effect of a keystroke. The concept is meaningful.

Eric.

___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Transforming BidiTest.txt to the format of BidiCharacterTest.txt

2014-02-12 Thread Eric Muller
Does anybody have a program that transforms the UCD file BidiTest.txt to 
the format of BidiCharacterTest.txt, and that they are willing to share?


Thanks,
Eric.

___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Representation of neutral tone in pinyin and bopomofo

2013-11-13 Thread Eric Muller

Is it correct that:

- in bopomofo, the neutral (or light) tone is represented by U+02D9 ˙ 
DOT ABOVE, and in the text representation, that character follows the 
bopomofo characters of the syllable (just like all the other characters 
for tones)


- in pinyin, the neutral tone is typically not marked, but it may be 
marked. When that's the case, U+02D9 ˙ DOT ABOVE is used.



When U+02D9 is used in pinyin, where it is in the character sequence? 
before the syllable to which it applies (where it is displayed) or after 
(like in bopomofo)?


When U+02D9 is used in bopomofo, it needs to be displayed before the 
syllable. Is the display position simply before the nearest preceding 
character from the set {U+3105 ㄅ BOPOMOFO LETTER B ... U+3119 ㄙ 
BOPOMOFO LETTER S, U+31A0 ㆠ BOPOMOFO LETTER BU ... U+31A3 ㆣ BOPOMOFO 
LETTER GU}?


Thanks,
Eric.




Re: Can the combining diacritical marks combine with any base character?

2013-02-12 Thread Eric Muller

On 2/11/2013 12:49 AM, Richard Wordingham wrote:


The problem sequence is U+003E GREATER-THAN SIGN, U+0338 COMBINING LONG
SOLIDUS OVERLAY which is canonically equivalent to U+226F NOT
GREATER-THAN.


Which demonstrates: NFC applied to the serialization of an XML infoset 
is not the same as NFC applied to the text nodes and attributes of that 
infoset.



The short answer is that XML shall not do canonical
equivalence, at least, not on data; so doing would corrupt some of the
CLDR definitions,


That case is different: it's whether a use of text strings (CLDR in this 
case) can be indifferent to normalization. There are other cases, e.g. 
the regular expressions to validate some of Unihan's properties, which 
should not be normalized, and which assume that the data to be validated 
is in NFD.


Eric.




Re: Case-folding dotted i

2013-01-29 Thread Eric Muller

On 1/24/2013 2:15 AM, Richard Wordingham wrote:

If text is going to be processed, i+dot is wrong for Turkish, as the
Unicode casing rules for Turkish will capitalise it to I+dot+dot,
which should display with two dots. If you're going to use an explicit
dot, I'd have said U+0131, U+0307 would be better, though I still
think using an explicit dot is wrong in general. Richard.


Six abstract characters (hard dotted, dotless, soft dotted in 2 cases) 
for four coded characters, something has to break somewhere.


With the current practice, there is inherent ambiguity.

The current practice is tolerable only in the presence of locale 
information. In which case the addition of combining dots in case 
transformation is useless, and in fact harmful as Richard showed.


Adding more characters (as in creating the hard dotted form by using the 
dotless + combining dot) breaks current practice.



Eric.




Re: Too narrowly defined: DIVISION SIGN COLON

2012-07-11 Thread Eric Muller

On 7/11/2012 9:20 AM, Julian Bradfield wrote:
Unicode is about plain text. TeX is about fine typesetting. 


Too narrowly defined: Unicode.

I think Unicode is not just for plain text, but rather concerns itself 
with only the lower layer of /any /text system.


When it's plain text, Unicode has the burden of solving all the 
problems. When it's a richer system, there is the issue of cooperation 
between the layers, a situation that Unicode cannot ignore.


Eric.



Record-A-Thon is tomorrow – help record 50 langauges in a single day

2011-07-29 Thread Eric Muller



From http://blog.mightyverse.com/2011/06/300-languages-record-a-thon/


   On July 30th, 2011 we will meet at the Internet Archive in San
   Francisco, where volunteers will record the Universal Declaration of
   Human Rights http://www.un.org/en/documents/udhr/index.shtml
   (UDHR) in their native language(s). Mightyverse volunteers will
   assist recording at several recording stations. Each station will be
   equiped with a video camera, monitor, lighting, microphone and
   Mightyverse PhraseFarm teleprompter system to enable the capture of
   spoken language. These high quality recordings of native speakers
   will be made available at archive.org http://archive.org/ under a
   Creative Commons license.
   Mightyverse is excited to support the Long Now Foundation
   http://longnow.org/’s 300 languages project in its July 30th 2011
   record-a-thon http://rosettaproject.org/record-a-thon/. The goal
   of the 300 languages project is to record spoken language that has
   parallel translations in at least 300 languages. Towards that
   effort, Laura Welcher and her team at The Rosetta Project
   http://rosettaproject.org/ (an ongoing effort by The Long Now
   Foundation) have identified texts that already exist in parallel
   translations. Of those texts, we at Mightyverse were especially
   excited by the UDHR.



Signup for UDHR Recording: 
https://spreadsheets.google.com/spreadsheet/viewform?formkey=dEM2cW9wSm4za0VmSHZwTEI2amxhNUE6MQ 



Also, it will be a fun day with free form language recording, some 
speakers at the beginning of the day and at lunch and there'll be food 
and prizes for people who record.


Eric.



Re: OpenType update for Unicode 5.2/6.0?

2010-10-15 Thread Eric Muller

 I entirely second Peter's description.

Let’s keep this in perspective: consider just how much progress there 
has been in the last ten years. 


IMHO, we can all be grateful to Microsoft in that area. I don't believe 
any other company or group has been as instrumental in bringing real 
solutions to wide audiences.


Eric.



Re: Derived age regexp

2010-10-15 Thread Eric Muller

 On 10/15/2010 3:19 PM, Tim Greenwood wrote:
Is there any regular expression - in perl, or elsewhere, that enables 
searching on the derived age? I want to find all characters in a file 
added since Unicode 4.1.
I could write it all by processing against the derived age file, but 
it would be nice if it is ready to go.




Xquery on the XML representation of the UCD is your friend. Eg

---
declare namespace u = http://www.unicode.org/ns/2003/ucd/1.0;;

for $c in doc('ucd.all.flat.xml')//u:ucd/u:repertoire/u:ch...@age = 4.1]
return concat ($c/@cp,  , $c/@age,  , $c/@na, #xa; )
---

Eric.




Re: Bengali Script

2010-07-12 Thread Eric Muller

On 7/8/2010 5:09 PM, Tulasi wrote:

Ok I am correcting - Bangladeshi to Bengali.
   


The Government of West Bengal / Society for Natural Language Technology 
Research (a member of the Consortium) has a very strong preference for 
the term Bengla rather than Bengali.


Eric.




Re: Titlecasing iota subscript

2010-06-03 Thread Eric Muller

See also the FAQ, http://www.unicode.org/faq/greek.html#6

Eric.




NYT article: Using a New Language in Africa to Save Dying Ones

2004-11-13 Thread Eric Muller
http://www.nytimes.com/2004/11/12/international/africa/12africa.html?ex=1101365144ei=1en=b4b60fe9706acc9b 
http://www.nytimes.com/2004/11/12/international/africa/12africa.html?ex=1101365144ei=1en=b4b60fe9706acc9b

Eric.



Re: [africa] Unicode IDNs

2004-11-09 Thread Eric Muller




Works for me by clicking on the link in Chris's message.
Mozilla 1.7
Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7) Gecko/20040616.
Running on some configuration of XP SP2. The navigation bar show
"http://www..net".

Eric.





Re: Looking for a C library that converts UTF-8 strings from their decomposed to pre-composed form

2004-11-08 Thread Eric Muller
Deborah Goldsmith wrote:
It's worth pointing out that there is no such thing as precomposed 
Unicode. Normalization form C (NFC) could be called as precomposed 
as possible. There are some sequences of Unicode that can only be 
expressed using combining marks.

As well as single (precomposed) characters which have a sequence of more 
than one character as their NFC form. So NFC is not even as precomposed 
as possible.

Eric.



Re: Looking for the UDHR in Thai

2004-11-03 Thread Eric Muller
Ed, thanks for the pointers. I'll be in touch with you off-list.
In the mean time, I have Hindi, Sanskrit, Magahi and Bohjpuri versions 
at http://www.rawbw.com/~emuller/unicode/index.html.

Eric.



Looking for the UDHR in Thai

2004-11-02 Thread Eric Muller
I am using the various versions of the Universal Declaration of Human 
Rights at http://www.unhchr.ch/udhr/index.htm as test material. 
Unfortunately, the Thai version is an image, and the resolution is not 
good enough for me to even attempt to retype the document. Can somebody 
point me to either better images, or even better to a text version (any 
encoding, with or without markup)?

The only thing I have found so far is 
http://bkk2.loxinfo.co.th/~aithnd/tudhr.html, but this does not seem 
to be the complete text: the preamble is missing, is the text is much, 
much smaller than any of the other languages.

Thanks,
Eric.



Re: CYRILIC CAPITAL/SMALL LETTER PE WITH DESCENDER

2004-10-11 Thread Eric Muller
I have found a couple of books on Abkhaz, and I have scanned some pages. 
They are at http://www.unicode.org/~emuller/abkhaz. Page 88 of the 
second book is a reproduction of a third book, which seems very 
interesting, but that I have not been able to locate.

Eric.



CYRILIC CAPITAL/SMALL LETTER PE WITH DESCENDER

2004-09-28 Thread Eric Muller
It seems that Abkhaz, written in Cyrillic, uses a PE WITH DESCENDER, but 
I can't find this case pair in Unicode. I am missing something, or do we 
need to encode those?

Evidence:
- Daniels and Bright, p717, table 60.15, right column, 6th entry.
- Universal Declaration of Human Rights, at 
http://www.unhchr.ch/udhr/lang/abk.htm

By the way, the UDHR document uses its own character set. I have 
attached a tentative TEC map (for SIL's TECkit tool) to convert to 
Unicode. The PE with descender are converted to U+FFFD followed by 1 
for the uppercase and 2 for the lowercase.

Thanks,
Eric.
EncodingNameAbkhaz
DescriptiveName Abkhaz

Version 1.0
Contact mailto:[EMAIL PROTECTED]

pass (Unicode)

0x32  0x2116

0x25  0x40f
0x35  0x45f

0x29  0x4ac
0x30  0x4ad

0x2a  0xfffd 0x31
0x38  0xfffd 0x32

0x2b  0x4be
0x3d  0x4bf

0x3a  0x49a
0x36  0x49b

0x3f  0x4b4
0x37  0x4b5

0x401  0x40e
0x451  0x4e1

0x419  0x49e
0x439  0x49f

0x429  0x4b2
0x449  0x4b3

0x42a  0x4d8
0x44a  0x4d9

0x42d  0x4bc
0x44d  0x4bd

0x42e  0x4a8
0x44e  0x4a9

0x42f  0x4f6
0x44f  0x4f7

0x2116  0x4b6
0x33  0x4b7

0x402  0x30
0x403  0x31
0x404  0x32
0x405  0x33
0x406  0x34
0x407  0x35
0x408  0x36
0x409  0x37
0x40a  0x38
0x40b  0x39



Re: CYRILIC CAPITAL/SMALL LETTER PE WITH DESCENDER

2004-09-28 Thread Eric Muller
Michael Everson wrote:
At 08:12 -0700 2004-09-28, Eric Muller wrote:
It seems that Abkhaz, written in Cyrillic, uses a PE WITH DESCENDER, 
but I can't find this case pair in Unicode. I am missing something, 
or do we need to encode those?

U+04A6, U+04A7 are used in Abkhaz for that sound, I believe.

Isn't it problematic to have the distinction between (MIDDLE) HOOK and 
DESCENDER for GHE (494/4F6), KA (4C3/49A) and arguably EN (4C7/4C9) but 
not for PE?

That being said, I am not trying to beat the master of disunification 
8-) If we agree that 4A6/7 is it, then we need at least an annotation 
can be rendered with a descender instead of a hook, or may be go all 
the way for a change of the representative glyph to use a descender, 
since that is the form used in both DB and in the Abkhaz font.

Thanks,
Eric.



Re: valid characters in user names- esp. compatibility characters

2004-08-14 Thread Eric Muller
Tex Texin wrote:
However, I am curious as to whether some Users might read/write their names
using compatibility characters (esp. in ideographic markets) and object to the
characters being normalized through nfkc. 

There is a further problem there, because the CJK compatibility 
characters have a *canonical* decomposition. The UTC is working on some 
scheme for Ideographic Variation Sequences, and the intent is to use 
that to solve the canonical equivalence problem.

Eric.



Re: Writing Tatar using the Latin script; new characters to encode?

2004-07-27 Thread Eric Muller
Mark E. Shoulson wrote:
Unicode exists to support what people use.  Do people use Latin script 
for Tatar?  Evidence indicates that they do.  Should Unicode support 
it, then?  Certainly.  Does Unicode support it?  Yes, Unicode supports 
the Latin script, with gobs of extensions.  So what's the problem?
Latin n with descender, which is not encoded  but needed according to  
http://www.eki.ee/letter/chardata.cgi?lang=tt+Tatarscript=latin.

Eric.



Re: Proposal to encode dominoes and other game symbols

2004-05-25 Thread Eric Muller

Philippe Verdy wrote:
A suggestion for playaing cards: why not including the Tarots?
I mean in French the 4 Cavaliers figures, the 18 Atouts, and the Excuse
(which is not exactly a Joker); sorry I don't have their English names.
Make that 21 atouts (labeled 1 through 21), for a total of 78 cards. 
The cavalier is between the jack and the queen. Very popular game in 
high school and college in my days.

Eric.



Question on CLDR number patterns

2004-05-25 Thread Eric Muller




The decimal pattern for Arabic/Kuwait contains U+0660  ARABIC-INDIC
DIGIT ZERO, apparently for the MinimumInteger part (using the Java
DecimalFormat terminology), presumably to select the set of Arabic
digits. However, this mechanism does not seem to be part of the Java
patterns, so I suspect it was added by CLDR. But the best description I
have been able to find is in UTR #35:

The numbers element supplies information for formatting and
parsing
numbers and currencies. It has three sub-elements: symbols,
numbers, and currencies. The data is based on the
Java/ICU format. The currency IDs are from [ISO4217]. For
more information, including the pattern structure, see [JavaNumbers].

(The last pointer goes to the J2SE 1.4.1 documentation, and Sun says
"Products listed on this page have completed the Sun End of
Life process. ", btw)

So where can I find the documentation on the use of something other
than U+0030 0 DIGIT ZERO in a CLDR number pattern?

Thanks,
Eric.





Re: Question on CLDR number patterns

2004-05-25 Thread Eric Muller






Mark Davis wrote:

  
  
  
  
  
  
  
  The decimal format looks like the following:
  
  #,##0.###;#,##0.###-

I was actually looking the locales through the ICU explorer, which
apparently replaces the localizable characters by those specified in
the symbols, hence my confusion. 


  
   (We should add documentation in the futureso
that we don't depend on anything from Sun.)

Or anybody else, and not just because the links are not stable.

  
  BTW, it would probably be better to float your
questions on the CLDR mailing list.

Will do. That's the old unicore vs. unicode, but it's now cldr vs.
unicode.

Thanks,
Eric.





Re: ISO 15924 French name Gotique: a typo...???

2004-05-21 Thread Eric Muller

Michael Everson wrote:
Collins-Robert Senior Dictionnaire Franais-Anglais Anglais-Franais
gothique [architecture, style] Gothic. criture ~ Gothic script
That means Fraktur
gotique [ling] Gothic
That means Wulfilan
Stet. 

Le Petit Robert (1987) concurs with your assement:
---
GOTIQUE. Voir GOTHIQUE 3
GHOTIQUE 3 criture gothique, criture  caractres droits, ...  Ling. 
n. m. Le gothique ou (plus souvent) GOTIQUE, langue des Goths, rameau 
oriental des langues germaniques.

---
So does Vendryes in Fossey: the relevant chapter title is criture 
gotique.

Eric.




Writing Tatar using the Latin script; new characters to encode?

2004-05-11 Thread Eric Muller
According to www.eki.ee, there is a currently an effort to convert the 
writing of Tatar from Cyrillic to Latin.

1. Does somebody have more information about that effort?

Eki lists four characters as needed but missing in Unicode (see 
http://www.eki.ee/letter/chardata.cgi?lang=tt+Tatarscript=latin).

2. The case pair for barred o is encoded (U+019F and U+0275), and it 
seems that their confusion comes from less-than-perfect but annotated 
name for U+019F, and from the usage remark African. Can we 
authoritatively tell them that those two characters are the ones they 
want? Can we add a Tatar usage remark to both?

3. The case pair n with descender is definitely not encoded, and from my 
memory of the discussion of ghe with descender, we would want to encode 
them as separate characters (rather than with combining descenders on 
n). Is anybody working on that proposal?

Thanks,
Eric.
PS: sorry for the double post to unicode and unicore. However, given the 
current state of [EMAIL PROTECTED], this seems the best course of action.





Re: GB18030 and super font

2004-04-22 Thread Eric Muller






Raymond Mercier wrote:

  
  
  
  
But that link to proofing tools leads nowhere. Maybe it's not be so
easy to
get the CHS version.
  
  

http://www.amazon.com/exec/obidos/tg/detail/-/BBZ54P/qid=1082651762/sr=8-1/ref=pd_ka_1/103-8333725-5907026?v=glances=softwaren=507846

Includes ~140 fonts, mostly for CJK, Arabic, Hebrew but other scripts
as well. Includes "Simsun (Founder Extended)" aka "-", with
65,531 glyphs!

Eric.





Re: GB18030 and super font

2004-04-22 Thread Eric Muller






Raymond Mercier wrote:

  
  
  
  Mark Shoulson writes
  
their Super Font is bundled with Microsoft Office XP, and
 even Microsoft's prices haven't gotten that high!
  
>From Microsoft,
  
  http://www.microsoft.com/globaldev/DrIntl/columns/015/default.mspx :
  
"A font that contains Simplified Chinese glyphs from both CJK Extension
A
and B sets is "SimSun (Founder Extended)" (SurSong.ttf in the system), 

The following ideographs characters are apparently mapped: all of the
URO, all 12 unified ideographs from the CJK Compatibility Ideographs
block, all of Ext A, 36,862 of Ext B (out of 42,711), some of the
compatibility ideographs of the CJK Compatibility Ideographs block,
none of the CJK Compatibility Supplement.

Eric.





Re: Fixed Width Spaces (was: Printing and Displaying DependentVowels)

2004-04-04 Thread Eric Muller


Kenneth Whistler wrote:

Uh, no. ZWNBSP, SPACE, ZWNBSP is equivalent to NBSP. 

I suspect that equivalent is only for some aspects.

In particular, NBSP has a bidi category of CS, which means that A 
0NBSP7 B (in bidi notation) displays as B 0 7 A, while A 0ZWNBSP, 
SPACE, ZWNBSP7 B displays as B 7 0 A.

Eric.





Re: Converting between Shift-JIS and Unicode

2004-04-01 Thread Eric Muller


Rick Cameron wrote:

Could you please point me to information on the relationship between JIS X
0208-1990 (as represented by the kJis0 field in Unihan.txt) and Shift-JIS?
The JIS X 0208 and JIS X 0213 include a description (in Japanese, but 
with pictures) of the relationship. They are available on line:  go to 
http://www.jisc.go.jp/; click on the blue button labeled JIS in the 
Search section; enter JIS X 0213 in the third input box; click on the 
attached button; select JIS X 0213 in the resulting list. The first PDF 
is about the changes in :2004, the others are actually :2000. Similarly 
for 0208.

You can also consult Ken Lunde's excellent book:  CJKV Information 
Processing, ISBN 1-56592-224-7 
(http://www.oreilly.com/catalog/cjkvinfo/index.html).

If you intend to do almost any work on CJK stuff, you need those two 
sources, and some more.

Eric.





Re: LATIN SMALL LIGATURE CT

2004-02-27 Thread Eric Muller


Peter Constable wrote:

Adobe included this and other ligatures in their use of the PUA for
their own legacy reasons;
More specifically, to allow applications which do not have OpenType 
layout to display those ligatures.

it has otherwise never been necessary for them
to do so in their Pro fonts, 

Indeed, InDesign does not use those code points.

and I believe they are moving away from
that practice.
Not until the majority of applications support OpenType layout, and, 
more importantly, the majority of documents have been migrated to not 
use those PUA characters. In other words, probably never for existing 
fonts.

Eric.





Re: collation of small capitals

2004-01-31 Thread Eric Muller


Philippe Verdy wrote:

The most common use I have seen of small capitals is as a font style, where
they were used to represent lowercase letters (the uppercase letters being
presented with full-height style).
It is not proper to encode English, French, ..., text that is eventually 
rendered using small capital glyphs using the characters U+1D00  LATIN 
LETTER SMALL CAPITAL A and friends. One should use the regular 
letters, either uppercase (e.g. in acronyms and such) or lowercase (e.g. 
at the beginning of chapter) as appropriate. One way to choose between 
uppercase and lowercase is to consider what happens if only the 
uppercase and lowercase glyphs are available. The fact that those 
character occurrences are intended to be rendered using small cap glyphs 
does not change their identity.

Conversely, the identity of the LATIN LETTER SMALL CAPITAL characters is 
closely connected to IPA, UPA and, more generally, phonetic texts.

This is what TUS 4.0 is saying page 171, under Typographic variants.

Eric.





Re: Useful Breton links

2004-01-16 Thread Eric Muller

I'd really like to know more about Breton, [...] it is not supported by public schools

Breton is taught in public schools in France, including in bilingual 
programs in elementary schools (about 50 schools in 2002). Look for Div 
Yezh.

Kenavo,
Eric.




Re: Useful Breton links

2004-01-16 Thread Eric Muller


Michael Everson wrote:

And Skol Diwan. 
These are indeed schools that teach Breton and other subjects in Breton, 
but they are not public schools, in the sense of being run by the state.

There are true public schools that teach in Breton, and Div Yezh is a 
parent's association that promotes their development; according to their 
site, there are about 50 such schools, enrolling about 3000 students. Of 
course, we are speaking K-12, or about 3 to 18 years old. In those 
schools, the teaching is half in French and half in Breton.

That being said, it is true that France is doing about as little as it 
can to support Breton and other regional languages, in schools and 
elsewhere.

Eric.





Re: OT: Free Fonts

2003-12-04 Thread Eric Muller






John Hudson wrote:
ClearType
is a proprietary renderer that Microsoft don't share with anyone. 
Although that is changing:


http://www.infoworld.com/article/03/12/03/HNmicrosoftip_1.html

Towards the end:

Microsoft expects most licensing
arrangements to
be made one-on-one with interested companies. Formal programs such as
the
ClearType and FAT system arrangements will be relatively rare, Kaefer
said. Microsoft picked those two technologies as its
licensing guinea pigs because it had outside vendors interested in
becoming customers. Agfa Monotype Corp. plans to use ClearType-related
technology in its iType font rendering system, [...]
  
Eric.





Linguistic Diversity and National Unity: Language Ecology in Thailand

2003-11-17 Thread Eric Muller
I just finished reading Linguistic Diversity and National Unity: 
Language Ecology in Thailand  by William Smalley, University of Chicago 
Press, ISBN 0-226-76288/9, and I found it very interesting. However, I 
have no reference to judge it against.  Can anybody comment on it? Any 
significant change since it was published 10 years ago?

Thanks,
Eric.




Re: Hacek - Typing from a keyboard... Help!!!!

2003-10-29 Thread Eric Muller


Rick McGowan wrote:

Caron [...] is *NOT* in current use at all in  English. 

It is widely used in the typography community, for better or for worse.

Eric.





Re: Unicode and Script Encoding Initiative in San Jose Mercury News

2003-10-25 Thread Eric Muller


Doug Ewell wrote:

[...] about You see, boys and girls, computers think only in numbers 
-- in a Silicon Valley paper,


[...] Should we tell them about real quotes?
real quotes are not just for Web publication; they are also for email. 
Throw in real dashes, of the kind  en or em  you prefer

Eric.
8-)




Re: Some questions about fractions

2003-09-30 Thread Eric Muller






Jill Ramonsky wrote:
I'm
wondering, exactly how equivalent are the following sequences:
  
  
U+00BC (vulgar fraction one quarter)
  
U+215F U+0034 (fraction numerator one; digit four)
  
U+0031 U+2044 U+0034 (digit one; fraction slash; digit four)
  
  
In particular, should they be rendered with the same glyph?

I would rather phrase the question as "if all characters involved are
supported by a given layout system, should those three sequences
produce the same visible result? The first change is to account for an
implementation not supporting, e.g., U+215F. The second change is to
allow different glyph organizations, as long as they produce the same
ink.

That being said, I think the answer is yes, assuming non digits around
the last one.
Is it
possible to compose a single glyph for (say) twenty two over seven,
using the fraction slash?

Sure. It is ok to have a single glyph for arbitrarily large fragments
of text, fractions or not. If a glyph displaying whole word is
available and appropriate for the circumstances, a rendering engine
could use it.
If I were
to write "one quarter" as U+0031 U+2044 U+0034, how should I then write
"one and a quarter"? Is there a "fraction space" which I should use to
separate the "1" from the "1/4"?

Unicode 4.0 Section 6.2 p 159 has the answer:
If the fraction is to be separated from a previous number,
then a space can be used, choosing the appropriate width (normal, thin,
zero width, and so on). For example 1 + THIN SPACE + 3 + FRACTION SLASH
+ 4 is displayed as 1.

Eric.





Re: Michael Everson in the news

2003-09-25 Thread Eric Muller
See also http://www.technologyreview.com/articles/innovation10903.asp, 
which is apparently about SEI.

Eric.





Re: W3C Objects To Royalties On ISO Country Codes

2003-09-25 Thread Eric Muller
See also http://news.com.com/2100-1032-5079256.html.

Eric.





Re: About that alphabetician...

2003-09-25 Thread Eric Muller






Michael Everson wrote:
An Irish colleague
here said he liked the article but noted that the Times' web directors
don't use Unicode
  
  
  ...

meta http-equiv="charset" content="iso-8859-1"

...


  

There is an alternative point of view, which says that charset declared
in an HTML (or XML) document is no more than an encoding scheme, and
that all characters in those documents are fundamentally Unicode
characters (i.e. they start in life with the full semantic of Unicode,
they don't inherit it on the occasion of character set conversion).
That view is supported by the XML spec itself, and by the infoset
definition. And because we have numeric character entities, using an
iso-8859-1 encoding scheme is not really a limitation: witness this
message, which contains U+10DB  GEORGIAN LETTER MAN and U+092E 
DEVANAGARI LETTER MA.

Eric.







Re: Questions on Myanmar encoding

2003-09-24 Thread Eric Muller
Thank you very much for your help.


I don't know what you mean third row of Table 10.3.

It is in Unicode 4.0, section 10.3, page 273,  and  you can see it  at: 
http://www.unicode.org/versions/Unicode4.0.0/ch10.pdf#G24999

With current model..
1018 102C 1039 101B 1031 102C 1010 102C 101C 1032 0020 1001 1004 1039 200C
1017 1039 101A 102C 1038 = What you said?
correct. But it used Space(0020), current rules said to use ZWSP.

Ok.

1021 1013 1031 101B 102D 1000 1014 1039 200C 1012 1031 102C 1039 200C 101C
102C 0020 1042 1048 0020 1042 002C 1040 1040 1040 0020 1000 1030 100A 102E=
US$28 2,00 ...? I think help? 1000 1030 100A 102E 1015 102C
Just one character wrong 1031on third place should be 1012. 

my original: 1021 1013 1031...
your correction: 1021 1013 1012 ...
I am a bit confused, and looking more carefully, my new guess is: 1021 
1019 1031... Apparently, that makes the first word sound like american.

And there should
be no space between 18 2,00.
ok

1010 102D 101B 1005 1039 1006 102C 1014 1039 200C 1025 101A 1039 101A 102C
1025 1039 200C 1018 102F 102D 1037 0020 2018 1015 1004 1039 200C 1012 102C
1014 102E 2019 0020 101B 1031 102C 1000 1039 200C =  'PandaNi' for zoo...
I don't know what red Panda mean but flow is correct just one big mistake
there
is no 1025 1039 200C LetterU Killer. The characters after 1021 to 102A can
not use as character of Killer.
It is 1009 1039 200C. Character 1009 NYA have two glyph. What you see on
Unicode is normal form glyph. Another form glyph is similar, you can say
same, with 1025 U
used only to pressed killer and Character, that have subscript, also another
kind of killer.
I think I understand. Also, I corrected 1018, which should be 101E.

Nice to see you and I'm also wish to change some encoding rules but every
body said TOO LATE.
Just to be clear, I am not proposing any modification to the encoding 
model. At best, I can think of clarifications that could help people 
like me, who have limited knowledge of the script.

In another place in your message, you mention that the current model is 
not optimal for sorting. I am not a specialist of sorting, but this is 
not an entirely unusual situation. It is in general not possible to make 
the encoding model such that it is optimal for all processings 
(rendering, sorting, etc.) You may want to check carefully the UCA, to 
see if and how it can handle proper sorting.

Eric.





Questions on Myanmar encoding

2003-09-18 Thread Eric Muller
1. What is encoded by the sequence of characters

U+1004  MYANMAR LETTER NGA
U+1039  MYANMAR SIGN VIRAMA
U+1004  MYANMAR LETTER NGA
is it kinzi + consonant NGA or consonant NGA+ subscript consonant NGA? 
Should we add some words to Table 10.3 to clarify that?

2. Does consonant + subscript consonant NGA ever appear? If so, how is 
it rendered? If not, should we remove U+1004 from the third row of Table 
10.3?

3. About Table 10.3: it is true that *in the encoding model* a cluster 
is always made of one element of each row, with row 2 (consonant) 
mandatory and the other rows optional?

4. Is that model realistic, or are there some exceptions, that is real 
life situations that it does not capture? Of cases where the encoding is 
possible, but not intuitive (e.g. two clusters in the encoding instead 
of one)?

5. Is is correct to view the kinzi as a medial form of NGA, which just 
happens to be encoded at the front of the cluster? For what values of 
correct?

6. Finally, I have tried to encode various strings I have seen in print 
(or rather as pictures of printed stuff). I would really appreciate if 
somebody could check my encodings. By the way, I found the introduction 
to the Burmese script on that site very interesting. In particular, not 
having to consider encoding made the presentation more accessible (i.e. 
it provides the level of expertise needed to understand the Composite 
Characters subhead in section 10.3).

Thanks,
Eric.


http://www.seasite.niu.edu/Burmese/PictureGallery/headin1.jpg
U+1018 MYANMAR LETTER BHA
U+102C MYANMAR VOWEL SIGN AA
U+1015 MYANMAR LETTER PA
U+1039 MYANMAR SIGN VIRAMA
U+101B MYANMAR LETTER RA
U+1031 MYANMAR VOWEL SIGN E
U+102C MYANMAR VOWEL SIGN AA
U+1010 MYANMAR LETTER TA
U+102C MYANMAR VOWEL SIGN AA
U+101C MYANMAR LETTER LA
U+1032 MYANMAR VOWEL SIGN AI
U+0020 SPACE
U+1001 MYANMAR LETTER KHA
U+1004 MYANMAR LETTER NGA
U+1039 MYANMAR SIGN VIRAMA
U+200C ZERO WIDTH NON-JOINER
U+1017 MYANMAR LETTER BA
U+1039 MYANMAR SIGN VIRAMA
U+101A MYANMAR LETTER YA
U+102C MYANMAR VOWEL SIGN AA
U+1038 MYANMAR SIGN VISARGA
http://www.seasite.niu.edu/Burmese/PictureGallery/headin2.jpg
U+1019 MYANMAR LETTER MA
U+102C MYANMAR VOWEL SIGN AA
U+1010 MYANMAR LETTER TA
U+102D MYANMAR VOWEL SIGN I
U+1000 MYANMAR LETTER KA
U+102C MYANMAR VOWEL SIGN AA
http://www.seasite.niu.edu/Burmese/PictureGallery/headin4.jpg
U+1021 MYANMAR LETTER A
U+1013 MYANMAR LETTER DHA
U+1031 MYANMAR VOWEL SIGN E
U+101B MYANMAR LETTER RA
U+102D MYANMAR VOWEL SIGN I
U+1000 MYANMAR LETTER KA
U+1014 MYANMAR LETTER NA
U+1039 MYANMAR SIGN VIRAMA
U+200C ZERO WIDTH NON-JOINER
U+1012 MYANMAR LETTER DA
U+1031 MYANMAR VOWEL SIGN E
U+102C MYANMAR VOWEL SIGN AA
U+1039 MYANMAR SIGN VIRAMA
U+200C ZERO WIDTH NON-JOINER
U+101C MYANMAR LETTER LA
U+102C MYANMAR VOWEL SIGN AA
U+0020 SPACE
U+1042 MYANMAR DIGIT TWO
U+1048 MYANMAR DIGIT EIGHT
U+0020 SPACE
U+1042 MYANMAR DIGIT TWO
U+002C COMMA
U+1040 MYANMAR DIGIT ZERO
U+1040 MYANMAR DIGIT ZERO
U+1040 MYANMAR DIGIT ZERO
U+0020 SPACE
U+1000 MYANMAR LETTER KA
U+1030 MYANMAR VOWEL SIGN UU
U+100A MYANMAR LETTER NNYA
U+102E MYANMAR VOWEL SIGN II
http://www.seasite.niu.edu/Burmese/PictureGallery/headin10.jpg
U+1021 MYANMAR LETTER A
U+1039 MYANMAR SIGN VIRAMA
U+101D MYANMAR LETTER WA
U+1014 MYANMAR LETTER NA
U+1039 MYANMAR SIGN VIRAMA
U+200C ZERO WIDTH NON-JOINER
U+101C MYANMAR LETTER LA
U+102F MYANMAR VOWEL SIGN U
U+102D MYANMAR VOWEL SIGN I
U+1004 MYANMAR LETTER NGA
U+1039 MYANMAR SIGN VIRAMA
U+200C ZERO WIDTH NON-JOINER
U+1038 MYANMAR SIGN VISARGA
U+1017 MYANMAR LETTER BA
U+102E MYANMAR VOWEL SIGN II
U+1007 MYANMAR LETTER JA
U+102C MYANMAR VOWEL SIGN AA
U+101C MYANMAR LETTER LA
U+1039 MYANMAR SIGN VIRAMA
U+101A MYANMAR LETTER YA
U+1039 MYANMAR SIGN VIRAMA
U+101F MYANMAR LETTER HA
U+1031 MYANMAR VOWEL SIGN E
U+102C MYANMAR VOWEL SIGN AA
U+1000 MYANMAR LETTER KA
U+1039 MYANMAR SIGN VIRAMA
U+200C ZERO WIDTH NON-JOINER
U+1014 MYANMAR LETTER NA
U+102F MYANMAR VOWEL SIGN U
U+102D MYANMAR VOWEL SIGN I
U+1004 MYANMAR LETTER NGA
U+1039 MYANMAR SIGN VIRAMA
U+200C ZERO WIDTH NON-JOINER
http://www.seasite.niu.edu/Burmese/PictureGallery/zoo.ht1.jpg
U+1010 MYANMAR LETTER TA
U+102D MYANMAR VOWEL SIGN I
U+101B MYANMAR LETTER RA
U+1005 MYANMAR LETTER CA
U+1039 MYANMAR SIGN VIRAMA
U+1006 MYANMAR LETTER CHA
U+102C MYANMAR VOWEL SIGN AA
U+1014 MYANMAR LETTER NA
U+1039 MYANMAR SIGN VIRAMA
U+200C ZERO WIDTH NON-JOINER
U+1025 MYANMAR LETTER U
U+101A MYANMAR LETTER YA
U+1039 MYANMAR SIGN VIRAMA
U+101A MYANMAR LETTER YA
U+102C MYANMAR VOWEL SIGN AA
U+1025 MYANMAR LETTER U
U+1039 MYANMAR SIGN VIRAMA
U+200C ZERO WIDTH NON-JOINER
U+1018 MYANMAR LETTER BHA
U+102F MYANMAR VOWEL SIGN U
U+102D MYANMAR VOWEL SIGN I
U+1037 MYANMAR SIGN DOT BELOW
U+0020 SPACE
U+2018 LEFT SINGLE QUOTATION MARK
U+1015 MYANMAR LETTER PA
U+1004 MYANMAR LETTER NGA
U+1039 MYANMAR SIGN VIRAMA
U+200C ZERO WIDTH NON-JOINER
U+1012 MYANMAR LETTER DA
U+102C MYANMAR VOWEL SIGN AA
U+1014 MYANMAR LETTER NA
U+102E MYANMAR VOWEL SIGN II
U+2019 RIGHT SINGLE QUOTATION 

Re: Faulty ligatures in Adobe PhotoShop

2003-08-27 Thread Eric Muller






Doug Ewell wrote:

  Anto'nio Martins-Tuva'lkin antonio at tuvalkin dot web dot pt wrote:

  
  
The bad part of it is that the ligated characters shown (in the
sencond and third examples) seem to include a long "s" instead of an
"f"...  ty_06.gif attached for reference.
  

Thanks for the report, Ill forward to the Photoshop guys. By the way,
the font is apparently Adobe Caslon Pro.

  
Substituting an unligated i (U+017F + U+0069) for fi (U+0066 + U+0069)
makes no sense at all.  If the current font doesn't contain an 
ligature (U+FB01), Photoshop should just leave the combination alone.

More likely, the image was created in Illustrator or some such, and the
glyph selected manually by the author. I did not check explicitly, but
I am ready to bet a whole lot that the font does the correct thing.

Eric.





Re: Unicode 4.0 is online at last!

2003-08-14 Thread Eric Muller


Peter Kirk wrote:

And indeed the software being used is produced by a consortium member. 
Perhaps the embarrassment should be more that member's, that their 
software is not Unicode compatible.
The member in question is a company. Companies are not embarrassed nor 
ashamed.


 whilst implementing the  full standard, including all scripts... 
Even the ones that would be newly  defined in the next version... ;-) 

Actually, we don't need that much: there are very few strings set in 
other scripts, and for those, it would be quite acceptable to do the 
typesetting by hand - in fact, that's precisely what is done today.

Eric.





Re: Does Unicode 3.1 take care of all characters of 'Hong Kong SupplimentaryCharacter Set - 2001' (HKSCS-2001) ?

2003-08-04 Thread Eric Muller


John McConnell wrote:

The mapping of the HKSCS 2001 repertoire to ISO/IEC 10646-2:2001 has

35 mapped to the private use area
1651 mapped to supplementary plane 2
511 mapped to the Extension A block (on the BMP)
2212 mapped to the CJK Ideographic block (also on the BMP)
plus another 278 mapped elsewhere on the BMP
 

35 + 1651 + 511 + 2212 + 278 = 4687.

HKSCS 1999 has 4,702 characters, and HKSCS 2001 adds 116, for a total of 
4818.  I believe the that 131 unaccounted for in your decomposition are 
in the mapped elsewhere pile, which should be 409.

Eric.





Re: From [b-hebrew] Variant forms of vav with holem

2003-07-30 Thread Eric Muller


Mark Davis wrote:

The UTC accepts and considers proposals from other parties (see
http://www.unicode.org/pending/proposals.html for submitting a
proposal for new characters). For complex matters (which this
definitely seems to be, based on the volumn of mail!), it is far and
away the best if someone can attend the appropriate UTC meeting to
explain the details of the proposal, with the pros and cons of
different approaches. 

And be aware that not every UTC member will be up to speed at all. 
Personally, I have next to zero knowledge of Hebrew, and I did not read 
the 400 messages on the subject. I am willing to learn, but it is not 
going to happen without a self-contained proposal. I would also 
appreciate if we can get pointers to background material on Biblical 
Hebrew that is readily available in the US (please make the subject line 
specific and new if you send that to [EMAIL PROTECTED]).

Eric.





Re: U+23D0 VERTICAL LINE EXTENSION

2003-07-24 Thread Eric Muller


Alan Wood wrote:

I think this leaves only one character in the old Symbol font that does not
have a Unicode equivalent:
RADICAL EXTENDER (decimal 96 in the Windows version)
 

When I prepared the proposal for U+23D0  VERTICAL LINE EXTENSION, it 
was indeed to ensure the complete representation of some other character 
set in Unicode. My target was actually the PUA usage defined by Adobe, 
which included what's needed for the Symbol font.

I did not consider perfect round tripping a necessity: it was enough for 
me to allow the conversion of old data to Unicode, and to leave the old 
world behind. Nor do I consider having a perfect handling of symbol 
pieces in a Unicode only world a necessity: exchanging with somebody the 
plain text ...U+23B2  SUMMATION TOP U+23B3  SUMMATION BOTTOM ...  
does not improve one bit our communication over exchanging ...U+2211  
N-ARY SUMMATION... Nor do I consider symbol pieces a good solution for 
typesetting (*glyphs* for the symbol pieces may be a good thing for that 
problem, but that requires more communication between a layout engine 
and a font than a mapping from characters to glyphs).

For the RADICAL EXTENDER, I could not convince myself that such a 
character was needed; U+23AF  HORIZONTAL LINE EXTENSION is a fine 
character to use for that purpose. U+23D0  VERTICAL LINE EXTENSION was 
much easier to justify (i.e. nothing else made sense) and there was the 
model of U+23AF  HORIZONTAL LINE EXTENSION to build on.

This represents only my opinion, and explains why I did not propose 
RADICAL EXTENDER. It says nothing about how the UTC would react to such 
a proposal.

Eric.





Re: U+1D29

2003-05-30 Thread Eric Muller


Anto'nio Martins-Tuva'lkin wrote:

I've just downloaded the PDF files with 4.0 additions (U40-*.pdf). One
question: How is one supposed to tell apart the glyphs for U+1D29 and 
U+1D18?... Or one isn't?... 

In the same way that you tell apart the glyphs for U+0050 P LATIN 
CAPITAL LETTER P and U+03A1 ? GREEK CAPITAL LETTER RHO?

Eric.






Re: ISO 8859_2 and Windows 1250

2003-03-12 Thread Eric Muller


Otto Stolz wrote:

CP 1250 contains the ISO 8859-1 characters, hence it is not
suited for slavic laguages. 
I suspect that Otto meant to type CP 1252 contains...

Eric.





Re: Handwritten EURO sign

2003-02-07 Thread Eric Muller
The latest issue of Baseline (www.baselinemagazine.com) has an article 
on the Euro. I did not read it, so I don't know if it speaks of 
handwritten forms.

Sign of the times: the euro currency symbol  by Conor Mangat.

Eric.





Re: 4701

2003-02-01 Thread Eric Muller


Michael Everson wrote:


Happy New Year of the Yáng to everybody! (I can't work out whether 
it's the Year of the Sheep, the Goat, or the Ram.)

Ram.

Eric.






Re: urban legends just won't go away!

2003-01-31 Thread Eric Muller


Barry Caplan wrote:


Who knew in this day and age flipping bits to change case is still publishable (this is from today!)
 

What I find a lot more objectionable is that what this code pretends to 
do is not defined (in particular, the domain over which it applies). 
Without such qualification, we cannot say if the code is correct or not, 
no matter how fishy it looks. In fact, this example is a perfectly valid 
implementation if the system pretends to handle only an appropriate 
subset of the Unicode character set.

For more information, see http://www.cs.utexas.edu/users/EWD/.

Eric.





Re: Documenting in Tamil Computing

2002-12-17 Thread Eric Muller


I don't understand what you meant by Unicode not being
mature enough to support multilingual emails. 

Maybe the argument is simply that there are not enough email agents that 
can render Tamil properly from Unicode-encoded text, and that email 
rarely has a useful life that justifies pain today.

Eric.





Re: converting devanagari to mangal unicode

2002-12-16 Thread Eric Muller
In order to convert any Devanagari font to be rendered in the same way, 


May be Sunil is just asking for a conversion of data, presumably from 
ISCII to Unicode.

Eric.





TDIL information on Indic languages.

2002-09-26 Thread Eric Muller




 This may be of interest for people working with Indic languages.

Eric.


 Original Message 

  

  Subject: 
  [li18nux:1096] Re: Linux Future Survey


  Date: 
  Thu, 26 Sep 2002 16:40:36 +0530


  From: 
  "Dutta Abhijit" [EMAIL PROTECTED]


  Reply-To: 
  [EMAIL PROTECTED]


  To: 
  [EMAIL PROTECTED]

  

 

Hello

We have been working to provide "standard" descriptions of various Indian
languages for developers to use.

The information is provided here:

http://tdil.mit.gov.in/news.htm  . See the newsletters for January and
April

1.  http://tdil.mit.gov.in/tdiljan2002.pdf (Sanskrit,
Hindi, Marathi, Konkani, Sindhi, Nepali)
2.  http://tdil.mit.gov.in/tdil-april-2002.pdf (Gujarati,
Malayalam, Oriya, Gurmukhi and Telugu )

Any thoughts ?

Regards,
Abhijit









[OT] looking for electronic dictionaries

2002-08-29 Thread Eric Muller

For my personal use, I would like to acquire electronic dictionaries, 
principally for the major European languages, with the following 
characteristics:

- reputable source

- raw datafiles accessible - I appreciate the interfaces that 
dictionary vendors may provide, but I want to be able to write my own 
code to find the data I am looking for

- the wordlist is the principal aspect; I can live without definitions.

- markup about the structure of words, for things like hyphenation, 
etc. (or from which hyphenation can be derived)

- some form of frequency count would be nice

For example, I'd like to compute something like: the average French 
character occupies x bytes in UTF-8, with average defined in sync with 
the frequency count. And I'd like to compute things like spelling 
changes introduced by hyphenation in Dutch.

Any pointers?

Thanks,
Eric.






Re: New version of TR29:

2002-08-15 Thread Eric Muller

   Your definition of LatinVowel is problematic. Is Y only a 
vowel in
   French? In a word such as yeux, it certainly is a consonant. Could
   this lead to problems?
 
  I don't think so, but I wait for the opinion of French speakers.
 
  What I can see is that things like l'yaourt [lja'ur] are normal in
  French
  spelling, and sometimes are to be found also in Italian (l'yoghurt
  ['ljogurt]).


y is either a vowel or a semi-consonant. When a semi-consonant, an
initial y does not cause elision, so le yaourt. Of course, there are
exceptions: yeuse (oak), yèble (?) and yeux (eyes). The usage is
both ways for yole (skiff). There are a few words starting with a
vowel y: y (there),  ypérite (mustard gas), ytterbium (?),
yttrium (?). Finally, there is elision before most proper nouns
starting with Y: Yonne (a river), York, etc.

That being said, here are a few problematic cases for your proposal:

prud'homme (a member of an industrial tribunal) is a single word, as
are his relatives prud'homal, and prud'homie.

Grevisse (Le bon usage, the authority on French usage) gives five
verbs which are considered a single word: entr'aimer (s'),
entr'apercevoir, entr'appeler (s'), entr'avertir (s'),
entr'égorger (s'); Le Petit Robert (1988, a well respected dictionary)
gives only the second one.

There is elision before the names of the consonants f, h, l, m, n, r, s,
x: admissible à l'X (accepted at X = École Polytechnique), devant
l'n (before the n).

grand'mère is definitely one word for me, but grand'rue,
grand'chose are not so clear. All are archaic forms and Le Petit
Robert does not list any of those (modern: grand-mère, rue
principale, grand chose').

Then there is spoken French: j'suis allé m'promener for je suis allé
me promener (I went for a walk). There are many such cases of elision
before a consonant.

This spoken French is of course very close to many dialects, or even
close languages (e.g. Picard, spoken in the North of France).

Did we mention that one never breaks a line after an apostrophe that
represents elision?

Speaking of French line break problems, there is also the case of the
;, which takes a space before and after: foo ; bar. Of course, one 
never breaks on the space just after foo. Same for :.

Eric.









OCR characters

2002-08-15 Thread Eric Muller

In our OCR fonts, we have two glyphs named erase (looks like a black 
square) and grouperase (looks like a long dash). I don't have a copy 
of the OCR standards, but I suspect those are mandated by these 
standards. On the other hand, and I can't find traces of those in 
Unicode, so I suspect they have been unified. But with which characters? 
More generally, are there other things like that we should aware of?

Thanks,
Eric.






Re: Missing character glyph- example

2002-08-01 Thread Eric Muller

John Hudson wrote:

   but it should *not* be encoded as U+ or as any other codepoint. 
 .notdef should be unencoded. 

Almost. OpenType specifies that there is no functional difference 
between a code point that is not mapped and a code point that is 
explicitly mapped to GID 0, so there is never a need to map any code 
point to GID 0. But at the same time, there is no prohibition against 
mapping explicitly a code point to GID 0.

Eric.






Re: [OpenType] library for identifying equivalent sequences

2002-07-31 Thread Eric Muller

I don't have what you are looking for [canonically equivalent strings], 
but I am curious how you plan to go from that to:

(The underlying issue is that I'm trying to figure out, given some
precomposed glyph in a font, what are all the valid substitutions that
could be applied in the smart-font code.)


Eg. don't you also want the strings that contain a sprinkling of ZWJ, 
ZWNJ, CGJ, SHY and various other things?

Eric.







Re: ZWJ and Latin Ligatures

2002-07-18 Thread Eric Muller

[EMAIL PROTECTED] wrote:

This was one of the basic
design criteria in order to ensure that support for a script could be added
by building a font using tools assessible to people with less that than
C-programming skills and without requiring any re-write of software.

Actually, the goal of easily add shapping for a new orthography and 
the goal of do not duplicate in all fonts what really belongs to an 
orthography are not as incompatible as we paint them. For example, 
there could a be plug-in mechanism (and those plug-ins could be written 
in a special-purpose language) for Uniscribe.

Eric.






Re: [OpenType] Proposal: Ligatures w/ ZWJ in OpenType

2002-07-16 Thread Eric Muller

I just reread the Unicode 1.0 standard on ZWNJ and ZWJ (p77), and it 
seems very similar to the the 3.2 explanation (although not as 
detailed). Am correct in thinking that the intents are the same, except 
may be for Indic scripts, or is there some other difference I did not spot?

Eric.







Re: [OpenType] Proposal: Ligatures w/ ZWJ in OpenType

2002-07-12 Thread Eric Muller



The mechanism proposed by John to handle ZWJ/ZWNJ makes the implicit assumption
that those characters are transformed into glyphs (via the usual 'cmap' mechanism)
and that this is the avenue to transfer the intent of those characters to
the shaping code in the font (i.e. some kind of ligature lookup). I'd like
to revisit that assumption.

The ZWJ/ZWNJ characters are formatting characters. Their function is definitely
different from the function of the "regular" characters (such as "A"): they
are a way to control the rendering of regular characters around them, and
to express that control in plain text. The debate so far shows that there
is no strong objection to that mechanism by itself.

In an environment richer than plain text, there is obviously the possibility
that this control could be expressed by other means than characters. In the
OpenType world, and in particular in the interface between the layout engine
and the shaping code in fonts, we have more than plain text, or rather plain
glyphs; we also have a description of which features should be applied to
which glyphs. So instead of having glyphs that stand for ZWJ/ZWNJ, can we
use these features?

In fact, we already do that every day. For example, an InDesign user can
insert the two characters x and y, and apply a ligature feature (let's say
'dlig') to them. It seems to me that this is just what ZWJ is about. So InDesign
could do the following given the character sequence x ZWJ y: map it the glyph
sequence cmap(x) cmap(y), with 'dlig' applied on those two glyphs. This 'dlig'
application takes precedence over one via UI, i.e. it happens regardles of
whether the user requested 'dlig' explicitly. The ZWJ character is simply
not mapped to the glyph stream, since the feature application does the job
of ZWJ.

We can handle ZWNJ in the same way: the sequence x ZWNJ y is transformed
to the glyph sequence cmap(x) cmap(y), with 'dlig' not applied on those two
glyphs. This 'dlig' non-application takes precedence over one via UI, i.e.
'dlig' is not applied to these two glyphs regardless of whether the user
requested 'dlig' explicitly.

[May be a better way of thinking about the precedence stuff is to think entirely
in markup terms: 
ligatures-on ... x ZWNJ y ... /ligatures-on is transformed
in the glyph stream dlig ... cmap(x) /dlig dlig cmap(y)
... dlig, i.e. dlig is off on the pair x y; hold your objection that
a feature is applied to a position rather than a range for a minute.]

With this approach, we gain two things. First, not having a "formatting"
glyph for ZWJ is IMHO a huge conceptual win, even bigger than not having
a "formatting" character ZWJ would be. Second, what John's proposal did not
mention (or may be I missed it) is that it's not just the ligature features
that have to deal with this glyph, it is all the features; compound
that by all the formatting characters, and you will start to understand Paul's
reaction.

It's interesting to note that this approach can be applied to other formatting
characters as well. Either their intent can be achieved by the layout engine
alone, without help of the font, in which case there is no need to show anything
to the code in the font; no glyph and no feature are consequence of those
characters. Or their intent needs help of the font, and the OpenType way
to ask for this help is to apply (or not) features.

All that takes care of selecting a ligature, but it does not quite take care
of selecting cursive forms. I can see how we could define 'dlig' to do that
(or define a 'zwj' feature that invokes the ligature lookups plus some single
substitution lookup), but I am not sure I am happy with that. In fact, I
am not sure I am happy with that clause in Unicode. 


Eric.

[About the features applied to ranges rather than positions: think about
it and it should be obvious 8-) It does not make sense to apply a ligature
at a position; what makes sense is to apply a ligature on range. Think about
1-n substitutions; whatever lookups apply to the source glyph should
also apply to all the replacement glyphs - ranges again. I even believe that
this approach is compatible with the current OpenType spec. More details
on demand.] 





Re: Acrobat, Unicode, Advanced usage

2002-07-09 Thread Eric Muller

Greenwood, Timothy wrote:

This question is pertinent to one asked me the other day for which I did not have an 
answer. Is the code set of an original document relevant for PDF - say EUC, SJIS, PDF 
- will the output perform text searches correctly for differing code set inputs?

PDF documents logically contain two streams: one of characters, and one 
of glyphs.

The glyph stream is always present physically, and is used for 
rendering. Depending on the fonts involved, the PDF generator, and all 
sorts of factors, the meaning of the numbers in that glyph stream, and 
the machinery to locate the actual outlines will vary quite a bit.

The character stream can be represented explicitly, in which case I am 
pretty sure it is always a Unicode stream. Alternatively, it can be 
computed from the glyph stream using various mechanisms; I believe that 
all the computations described in the PDF spec generate a Unicode stream.

The choice of explicit vs implicit character representation is up to the 
PDF producer. In all cases, I believe that the producer has the 
responsibility of converting from whatever character standard is used in 
the original document to Unicode. When the producer is Distiller, it may 
not have access to the original character content and be forced to 
create an approximation.



Eric.






What does Z variant mean for Han?

2002-07-09 Thread Eric Muller



In the description of the Han script (section 10.1), the Z axis is described
by:
The actual shape (typeface) attribute (the Z axis) is for differences
of type design (the actual shape used in imaging) of each variant form.
  
Let's take the concrete example of U+5516 and U+555E, with kZVariant entries
pointing at each other in Unihan.txt. 
  
Does Z variant mean that all the glyphs which are acceptable to represent
U+5516 are also acceptable to represent U+555E, and conversely? Of course,
some shapes may be more appropriate in some circumstances, much like a Fraktur
shape may be more appropriate than a Roman shape in some circumstances.
  
Does it go as far as making folding from one character into the other a useful
operation, assuming one is not interested in perfect round-tripping with
other character standards?
  
If a document that contains U+5516 is rendered, and one does a copy/paste
from a rendering of that document, is it acceptable to paste U+555E?
  
Are the answers to those questions different for other pairs of characters
that are z variant one of the other?
  
Thanks,
Eric.
  
  
  
  
  


Re: Hexadecimal characters.

2002-06-20 Thread Eric Muller

For the scripts which have their own digits, are there conventions to 
write hexadecimal numbers with those digits? If I read a Devanagari text 
book, will I see 20A7, or २०?७ (where ? stands for whatever is 
used for A)?

Thanks,
Eric.





Re: Encoding of symbols, and a lock/unlock pre-proposal

2002-05-20 Thread Eric Muller



Markus Scherer wrote:
[EMAIL PROTECTED]">They had practical
uses when user interfaces and display systems could not handle icons and
arbitrary images, but those times are long over. 
  
I wish this was the case, but most if not all systems insist that graphics
stored in a font be accessed as characters. This puts pressure on encoding
symbols. 
  
 Fonts as packages of graphics are unequaled in some respects:
  
they support a unique combination of geometric and structural information
(hinting); the later is vital at low rendering resolutions  and gives the
designer a say in what can be dropped and what should be preserved. Yes,
the rendering resolution of a given system continually improves (e.g. printers,
desktop screens), but at the same time we keep inventing new classes of devices
(e.g. PDA, phones), where cost constraints put us back at low resolution.
they offer a convenient way to put multiple, usually related, images
in one package
the publishing industry has figured out how to manage them (not that
it's easy or that they got a lot of help...)

  
Of course, they have many limitations (e.g. monochrome, no transparency,
etc). Nevertheless, it is useful to package images for symbols in fonts.
  
The problem is that essentially all systems insist that characters be used
to access the content of a font. I don't know of any system where I can specify
a graphic as "this glyph in this font", on the same terms as I can specify
"this .gif".
  
Eric.
  
  
  


Re: Greek Extended: question: missing glyphs?

2002-05-01 Thread Eric Muller

David J. Perry wrote:

If I were to make a complete OT Greek font, with all the above as well
as the combinations already in Unicode, which would provide better
performance: substitutions or positioning via OT features?

There is a similar thread on the [EMAIL PROTECTED] mailing list. My 
argument is that you cannot expect positioning to be more complete 
because it will form some combinations with apparently less work. 
Regardless of whether you use substitution or positioning, it's the 
testing of the actual combinations, alone and in context, that is the 
bottleneck. The difference between the two methods (which can be 
characterized as combination at the factory and combination at the 
installation site) are second order factors. In the case of CFF 
outlines, which have a notion of subroutine, the size difference is not 
that big. What is going to drive your choice is the ease of creating 
combination glyph vs. the ease of creating GPOS lookups in your font 
development environment, and how much you are willing to depend on 
Unicode/OpenType support in the target environment (I know some software 
that is more likely to handle substitutions than mark positioning).

Eric.






  1   2   >