Re: Combining Marks and Variation Selectors
That would imply some coordination among variations sequences on different code points, right? E.g. <0B48> ≡ <0B47, 0B56>, so a variation sequence on 0B56 (Mn, ccc=0) would imply the existence of a variation sequence on 0B48 with the same variation selector, and the same effect. Eric. On 2/2/2020 11:43 AM, Mark Davis ☕️ via Unicode wrote: I don't think there is a technical reason for disallowing variation selectors after any starters (ccc=000); the normalization algorithm doesn't care about the general category of characters. Mark On Sun, Feb 2, 2020 at 10:09 AM Richard Wordingham via Unicodewrote: On Sun, 2 Feb 2020 07:51:56 -0800 Ken Whistler via Unicode wrote: > What it comes down to is avoidance of conundrums involving canonical > reordering for normalization. The effect of variation selectors is > defined in terms of an immediate adjacency. If you allowed variation > selectors to be defined for combining marks of ccc!=0, then > normalization of sequences could, in principle, move the two apart. > That would make implementation of the intended rendering much more > difficult. I can understand that for non-starters. However, a lot of non-spacing combining marks are starters (i.e. ccc=0), so they would not be a problem. is an unbreakable block in canonical equivalence-preserving changes. Is this restriction therefore just a holdover from when canonical equivalence could be corrected? Richard.
Re: Keyboard layouts and CLDR
Indeed. But "Faÿ-lès-Nemours" / "FAŸ-LÈS-NEMOURS". "lès" in French place names means "near", typically followed by another city name or a river name. In the case of "L'Haÿ-les-Roses", it's just that they have a famous rose garden, so "les". Eric. On 1/30/2018 12:06 AM, Martin J. Dürst via Unicode wrote: On 2018/01/30 16:18, Philippe Verdy via Unicode wrote: - Adding Y to the list of allowed letters after the dieresis deadkey to produce "Ÿ" : the most frequent case is L'HAŸE-LÈS-ROSES, the official name of a French municipality when written with full capitalisation, almost all spell checkers often forget to correct capitalized names such as this one. Wikipedia has this as L'Haÿ-les-Roses (see https://fr.wikipedia.org/wiki/L'Haÿ-les-Roses). It surely would be L'HAŸ-LES-ROSES, and not L'HAŸE-LÈS-ROSES, when capitalized. I of course know of the phenomenon that in French, sometimes the accents on upper-case letters are left out, but I haven't heard of a reverse phenomenon yet. Regards, Martin.
0027, 02BC, 2019, or a new character?
https://www.nytimes.com/2018/01/15/world/asia/kazakhstan-alphabet-nursultan-nazarbayev.html Eric.
Re: Database missing/erroneous information
In the .grouped.xml file, if a does not have an attribute, it inherits it from its containing element. The group containing the digits has IDC="Y" OIDC="N" XIDC="Y", and so that applies to the digits as well. If you don't want to deal with the inheritance mechanism, just use the .flat.xml files, the elements carry all the attributes. Eric. On 7/12/2017 6:35 AM, J Decker via Unicode wrote: I started looking more deeply at the _javascript_ specification. Identifiers are defined as starting with characters with ID_Start and continued with ID_Continue attributes. I grabbed the xml database (ucd.all.grouped.xml ) in which I was able to find IDS, IDC flags ( also OIDS,OIDC, XIDS,XIDC of which meaning I'm not entirely sure of) but I started filtering out to find characters that are NOT IDS|IDC Something simple like numbers 0x30-0x39 are marked with IDS='N' but have no [ OX]IDC flags specified. Is a lack of flag assumed N or Y? www.unicode.org/reports/tr42/ documentation on the XML file format doesn't specify. http://www.unicode.org/reports/tr31/ I see 'ID_Continue characters include ID_Start characters, plus characters ' most languages do support identifiers like a1, a2, etc as valid identifiers, so certainly numbers should have IDC even though they're not IDS. Are there characters that are IDS without being IDC? There are certainly characters that are IDC without IDS. some examples. found char { cp: '0034', na: 'DIGIT FOUR', gc: 'Nd', nt: 'De', nv: '4', bc: 'EN', lb: 'NU', sc: 'Zyyy', scx: 'Zyyy', Alpha: 'N', Hex: 'Y', AHex: 'Y', IDS: 'N', XIDS: 'N', WB: 'NU', SB: 'NU', Cased: 'N', CWCM: 'N', InSC: 'Number' } (this has IDC notation but not IDS; since it says 'digit' I assume this is a number type, and should not be IDS.) found char { cp: '0F32', na: 'TIBETAN DIGIT HALF NINE', gc: 'No', nt: 'Nu', nv: '17/2', Alpha: 'N', IDC: 'N', XIDC: 'N', SB: 'XX', InSC: 'Number' } This might be not IDS but is IDC? found char { cp: '203F', na: 'UNDERTIE', gc: 'Pc', IDC: 'Y', XIDC: 'Y', Pat_Syn: 'N', WB: 'EX' } this is sort of IDS but not IDC? found char { cp: '309B', na: 'KATAKANA-HIRAGANA VOICED SOUND MARK', gc: 'Sk', dt: 'com', dm: '0020 3099', bc: 'ON', lb: 'NS', sc: 'Zyyy', scx: 'Hira Kana', Alpha: 'N', Dia: 'Y', OIDS: 'Y', XIDS: 'N', XIDC: 'N', WB: 'KA', SB: 'XX', NFKC_QC: 'N', NFKD_QC: 'N', XO_NFKC: 'Y', XO_NFKD: 'Y', CI: 'Y', CWKCF: 'Y', NFKC_CF: '0020 3099', vo: 'Tu' }
Bengla syllables <... 09BF 09BE> and <... 09BF 09C0>
In looking at the wiki{pedia,book.source,tionary} corpus for Bengla, I see a relatively large number of syllables with <... 09BF 09BE> or <... 09BF 09C0>. I checked a couple of sources, and I did not find them listed anywhere as being normally used. Are they in normal use or are those all typos? I did not find any occurrence in the Assamese corpus. Thanks, Eric. The syllables (o is the number of occurrences): o='54'/> o='93'/> o='171'/> o='238'/> o='79'/> o='75'/> o='157'/> o='125'/> o='118'/> o='58'/>
how would you state requirements involving sorting?
Suppose you help somebody write requirements for a piece of software and you see an item: Sorting. Diacritic marks need to be stripped when sorting titles You know that sorting is a lot more complicated than removing diacritics, and that giving the directive above to a naive developer is going to lead to trouble. You know you want to end up with an implementation involving the UCA with a tailoring based on the locale. How would you suggest to reword the requirement? Thanks, Eric.
Re: "textels"
On 9/16/2016 8:30 AM, Janusz S. Bien wrote: Quote/Cytat - Eric Muller <eric.mul...@efele.net> (pią, 16 wrz 2016, 17:03:54): On 9/16/2016 6:52 AM, Janusz S. Bień wrote: (when working on a corpus of historical Polish we noticed some cases where standard Unicode equivalence was not convenient). I'm very interested to know more about those cases. For our search engine we were unable to use compatibility equivalence "out of the box" for splitting the ligature because it also converted long s to short s while we wanted to preserve the distinction. I am interested in the problems with *canonical* equivalence. I thought that you were talking about those before. Compatibility equivalence is a completely different beast. It is, IMHO, too coarse a tool and best forgotten. For any particular task, it's typically doing too much (e.g. long/short s folding in your case) and too little (not everything you need). There was an attempt at improving the situation, by providing a whole bunch of fine grained, targeted transformations (http://www.unicode.org/reports/tr30/), but that did not pan out. Eric. Thanks, Eric.
Re: "textels"
On 9/16/2016 6:52 AM, Janusz S. Bień wrote: (when working on a corpus of historical Polish we noticed some cases where standard Unicode equivalence was not convenient). I'm very interested to know more about those cases. Thanks, Eric.
Emoji Feminism - The New York Times
http://www.nytimes.com/2016/03/13/opinion/sunday/emoji-feminism.html?_r=0
The Chinese Typewriter: The Design and Science of East Asian Information Technology
For those who are in the San Francisco Bay Area: https://library.stanford.edu/eal The Chinese Typewriter: The Design and Science of East Asian Information Technology During the 19th and 20th centuries, groundbreaking information technologies like the telegraph, the typewriter, and the computer changed the world. All of these technologies were designed with the alphabet in mind, however, leaving open the question: what about China, Japan, and Korea? In this exhibition, the history of modern East Asian information technology is explored through artifacts from the personal collection of Professor Thomas S. Mullaney (History) and the Stanford East Asia Library. Opening Reception and Guest Lectures by Jidong Yang (EAL) and Thomas S. Mullaney (History) on Wednesday, January 20 at 5pm. The exhibition is open from January 20, 2016 to September 10, 2016. Location: Lathrop East Asia Library - Map Link Audience: General Public, Faculty/Staff, Students, Alumni/Friends, Members Sponsor: Stanford University Libraries, History Department, Program in History and Philosophy of Science and Technology, Center for East Asian Studies, Department of East Asian Languages and Cultures
Re: Proposal for German capital letter "ß"
On 12/10/2015 2:45 AM, Frédéric Grosshans wrote: Le 10/12/2015 05:32, Martin J. Dürst a écrit : A similar example is the use of accents on upper-case letters in French in France where 'officially', upper-case letters are written without accents. Actually, the official body in charge of this (Académie Française) They actually mandate "Académie *f*rançaise". And "Imprimerie *n*ationale" (for Philippe; even if imprimerienationale.fr has forgotten that). has always recommended upper-case letters with accents , but the school teachers teach the other way, and accents on capital letters was technically challenging (in printing, writing machines and keyboard), Thanks to gallica.fr and archive.org, it is easy to see what actually happened until the middle of the 20th century. What I have seen is that in both cold and hot metal, until the end of the 19th century, one only and always sees É È Ê Ë Ç Œ Æ; on small caps, one can sometime find À Â Ô Ù. That matches all the descriptions of the "casse parisienne" and "police" (how many "a", "b", "c", etc in a font) I have seen in typography manuals. Around the beginning of the 20th century, one start to see books without accented capitals (and unfortunately books with inconsistent use of the accented capitals). Eric.
Toki Pona: A Language With a Hundred Words - The Atlantic
http://www.theatlantic.com/technology/archive/2015/07/toki-pona-smallest-language/398363/ Eric.
Re: UDHR in Unicode: 400 translations in text form!
On 6/28/2015 12:30 PM, Ken Shirriff wrote: I don't mean to be critical, but I find the UDHR page is really hard to use. Thanks for the observations. I'll try to find a better organization. Eric.
Re: UDHR in Unicode: 400 translations in text form!
On 6/28/2015 12:20 PM, Philippe Verdy wrote: Note: The marker icons showing languages in the Leaflet component (over the OSM map) are not working (broken links) Fixed, I believe. Also the locations assigned of some international languages is strange: Esperanto ... Picard ... Standard French These locations for those come from http://glottolog.org. Unless those locations are obviously wrong, I'd prefer to keep them aligned. But in fact I would have placed those international languages somewhere in the middle of an ocean, just aligned vertically in a list along a meridian (across the Atlantic or Pacific for example) A few are already in Antarctica. I'll move Esperanto and Interlingua there. Some languages do have an ISO 639-3 code. E.g. - Tetum, official in Timor-Leste, is currently coded as 010 (mapped to und in ISO 639-3), it should be tet. In general, identification of the language of the translations is not trivial. I have learned to not trust just the names provided with the translations. For this one, there is another translation, [tet], which most likely is tet/Tetun. [010] looks like a fairly different language and it is not clear to me that it is Tetun. I'd rather have some informed recommendation before assigning a language to [010]. It does not help that the source site does not seem accessible right now. - Forro (Saotomense) is a Portuguese-based creole in Sao Tome, currently coded as 007 (mapped to und), it should use cri. The OHCHR site warns: not to confuse Crioulo Santomense with Santomense (a variety and dialect of Portuguese in São Tomé and Príncipe) Again, I'd prefer some informed recommendation. - Kimbundu should also use kmb and not 009 - Umbundo (Umbundu) should also use umb and not 011 According to the Ethnologue, both Kimbundu and Umbundu are used both as language names and as family names. Given that I don't really trust the sources of those names, I'd prefer some informed recommendation. Thanks, Eric.
Re: UDHR in Unicode: 400 translations in text form!
On 6/28/2015 10:24 PM, Leo Broukhis wrote: Ukrainian is in Estonia, Estonian is in the Baltic sea. I took the locations from glottolog.org. The first error is mine, I mistyped a value. The second error comes from Glottolog, I corrected and reported to them. Will appear in the next update. Thanks, Eric.
UDHR in Unicode: 400 translations in text form!
I am pleased to announce that the UDHR in Unicode project (http://unicode.org/udhr) has reached a notable milestone: we now have 400 translations of the Universal Declaration of Human Rights in text form. The latest translation is in Sinhala, thanks to Keshan Sodimana, Pasundu de Silva and Sascha Brawer. Many thanks to them and to all the contributors. There is still plenty of work: most translations would benefit from a review, and there are 55 translations for which we have PDFs or images, but not yet the text form (look for stage 2 translations). The site has also been revamped a bit, with a more functional map, and a more functional table of the translations. The mapping to ISO 639-3 and BCP 47 have been updated to take into account the evolution of those standards. Again, thanks to all the contributors, past, present and future, Eric. PS: I believe I have taken care of all the backlog of contributions and comments. If I missed something, sorry, and please ping me again.
Re: WORD JOINER vs ZWNBSP
On 6/26/2015 3:48 AM, Marcel Schneider wrote: To do traditional French typography on the PC, or anywhere a justifying no-break space is needed along with the colon, because this punctuation must be placed in the middle between the word it belongs to and the following word. Actually, it's non-justifying and it's thin. U+202F ‘ ’ NARROW NO-BREAK SPACE is your friend. Eric.
Help with African characters, please
Can you help me identify the characters used in the Kulango, Bouna translation of the UDHR? The text is at http://www.ohchr.org/EN/UDHR/Documents/UDHR_Translations/kou.pdf. Look for article 14. What is the second letter of the word for article (after the N, looks like a greek nu), and what is the second letter of the first word (after the M, looks similar but different)? What is the letter that looks somewhat like an epsilon (but compare with the epsilon like in articles 13 and 15)? Thanks, Eric.
Re: Another take on the English apostrophe in Unicode
On 6/10/2015 9:37 PM, Philippe Verdy wrote: The French "pomme de terre" ("potato" in English, French vulgar synonym : "patate") is a single lemma in dictionaries, but is still 3 separate words (only the first one takes the plural mark), it is not considered a "nom composé" (so there's no hyphens). Grevisse, Le bon usage, 11th edition, 1980, page 118, part 1 Elements of the language, chapter 7 The words, section 3 Formation of new words, article 2, Composition, very first paragraph (179 overall): --- By composition, language creates new words, either by combining simple words with existing words, or by preceding these simple words with syllables that have no independent existence: Chou-fleur, gendarme, pomme de terre, contredire, désunir, paratonnerre. A word, despite being formed of graphically independent elements, is composed as soon at it brings to mind, not the distinct images of each of the words from which it is composed, but a single image. Thus the composites hôtel de ville, pomme de terre, arc de triomphe each remind of a unique image, and not of the distinct images of hôtel and of ville, of pomme and of terre, of arc and of triomphe. --- (hôtel de ville = city hall; pomme = apple, de = of, terre = earth) Paragraph 181, 3rd remark: --- Sometimes the elements composing [the word] are welded in a simple word: Bonheur, contredire, entracte; sometimes they are connected by an hyphen: chou-fleur, coffre-fort; sometimes they stay independent graphically: Moyen âge, pomme de terre. --- (“Le Grévisse” as we affectionately call it, or Le bon usage / French Grammar with remarks on today’s french language, is a must-have for the student of French. It is encyclopedic in its depth, and has tons of examples and counter-examples. Interestingly, the French wikipedia page says “a descriptive grammar of French”, while the English wikipedia page says “a prescriptive grammar”; it’s both!) I agree that we don’t need a new space coded character. I was just pointing out that some of the arguments for a new coded character for the apostrophe in don’t apply equally well to the spaces in the word pomme de terre. Eric.
Re: ucd beta, stable filenames
On 6/5/2015 8:48 AM, Daniel Bünzli wrote: Hello, Would it be possible in the future to publish the latest version of the ucd files without the -X.Y.ZdW suffixes under a fixed URI like http://www.unicode.org/Public/beta/ and/or simply publish it in the version directory but without the suffixes (like the ucdxml files do). With the current scheme it hard for implementers to automate file downloads for testing with the beta. +1000 Eric.
Re: Another take on the English apostrophe in Unicode
On 6/5/2015 10:29 AM, John D. Burger wrote: Linguistically, don't and friends pass all the diagnostics that indicate they're single words. If I am not mistaken, the french pomme de terre also passes the diagnostics. So we need a new space character. Eric.
Re: Tag characters
On 5/21/2015 1:25 PM, Asmus Freytag (t) wrote: On 5/21/2015 8:46 AM, Peter Constable wrote: Would Unicode really want to get into the business of running a UFL service? I suspect both Eric and I may have have been slightly tongue-in-cheek with respect to UFLs... Actually, I was serious. Eric.
Re: Tag characters
On 5/20/2015 7:11 PM, Doug Ewell wrote: In any event, URLs that point to images would be an awful basis for an encoding. I would make an exception for the URL http://unicode.org/Public/8.0.0/ucd/StandardizedFlags.html. Eric.
Re: Usage stats?
Would a corpus like wikipedia or Project Gutenberg be appropriate for you purpose ? Both are freely and easily accessible. http://dumps.wikimedia.org/backup-index.html and http://www.gutenberg.org/wiki/Gutenberg:Feeds#The_Complete_Project_Gutenberg_Catalog. Eric. ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Séminaire doctoral Chemins des écritures | Gripic
This seminar may be of interest to those in France. http://www.gripic.fr/evenement/seminaire-doctoral-chemins-ecritures Eric. ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: fonts for U7.0 scripts
How about even having just the glyphs in the Unicode.org charts being in the public domain? Very easy to achieve: 1. Ask the owner of the font how much money he wants to part with his property. 2. Write a check for the corresponding amount. 3. You are now the owner, you can put the font in the public domain. Eric. ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: Help with Hebrew
Many thanks for all the answers on my Hebrew and Arabic questions. On 7/6/2014 4:18 AM, Matitiahu Allouche wrote: The original text is interesting, combining French, Latin and Hebrew. There is also a fair amount of Greek, and a couple of Arabic words. Unfortunately, the author and/or the type setter were not quite proficient in Hebrew, so that the Hebrew words in the 3 referenced pages contain quite a few errors. I think it's a safe assumption that the typesetter was not necessarily fluent in Hebrew. I am not sure if the digitization should reproduce faithfully the flaws of the original document, or if it is an opportunity to correct the errors (which may not be possible for the first page). I want both! In my XML source, I do record things like correction original=mistaquemistake/correction, and render that in the EPUBs I produce by mistake [mistaque]. 1) Eric's representation of the Hebrew words in f274.image seems correct. So the Unicode sequences are Yod (U+05D9) Segol (U+05B6) Dalet (U+05D3) Segol (U+05B6) Alef (U+05D0) And Yod (U+05D9) Segol (U+05B6) Dalet (U+05D3) Segol (U+05B6) He (U+05D4) However, the Hebrew words are suspect: a. The first one (Yod Dalet Alef) is not a stem known in Hebrew. It could be a deformation of the stem Yod Resh Alef whose meaning is to fear (= the French craindre). I would not be surprised if the typesetter confused dalet and resh. The good news is that the text I pointed to is one of the many re-editions of the work, and we have a facsimile of the original edition: http://gallica.bnf.fr/ark:/12148/btv1b8626248g/f26.image Here is seems clear that it's a resh in both examples. By the way, the whole sentence reads roughly In Hebrew, there are words which are different only in that one ends with an aleph, and the other with a he, which are not pronounced, as which means fear and which means throw away. This follows a discussion that in French, champ and chant are pronounced the same, with the final p and t silent. b. Both grammatical forms (with Segol under the rightmost two letters in both words) do not conform to proper conjugation, as far as I know (conjugation of Hebrew verbs is not a matter for the faint of heart). The original edition seem to show a qamats. Would that be better? 2) The case of f299.image is yet more complicated: The original edition: http://gallica.bnf.fr/ark:/12148/btv1b8626248g/f53.image a. If you compare the rightmost letter in the Hebrew word following mais dans with the corresponding letter in the Hebrew word following pour, you can see that they don't look identical. The first one has a rounded top-right corner while the second one has a more square shape. The first letter looks like a Hebrew letter Resh (U+05E8) and the second looks like a Hebrew letter Dalet (U+05D3, and it is the correct one). The original seems to show Dalet in all three cases. Overall, what I see there is - dalet, sheva, bet, patah, resh, space, shin, segol, qof, segol, resh - shin, segol, qof, segol, resh - dalet, sheva, bet, patah, resh - dalet, qamats, bet, qamats, resh The text is telling how the genitive is marked differently in Latin and in Hebrew. In Latin, in verbum falsitatis, it's falsitas that has been transformed into falsitatis to mark the genetive, while in Hebrew, it's (the word for verbum) that is modified. c. When a word starts with Dalet, there should generally be a Dagesh in the Dalet. That brings an interesting question. If you look at the French in the two editions (1660 and 1810), you will see that they different orthographies, and that today's orthography (2014) is yet another one. There is no reason this would not happen in the same way for the Hebrew. So what I am really after is - what's on the page - what was meant to be on the page, when the editions were made (1660, 1810) - what one would want to put on the page if one were to make a modern edition, with modern orthography throughout Is it plausible that the dagesh would only be in the last case (modern orthography), since it's clearly absent in both facsimiles? d. The point on the Shin (rightmost letter of the second word) is a Sin Dot, while it should be a Shin Dot, None in the original edition, apparently. The expression was probably quoted from Exodus XXIII, 7, where the vowel under the Bet is a Patah, which is also the way it would be written in modern Hebrew. So the right sequences (after correcting the errors in the original document) are - Dalet (U+05D3) Dagesh (U+05BC) Sheva (U+05B0) Bet (U+05D1) Patah (U+05B7) Resh (U+05E8) Space Shin (U+05E9) Shin Dot (U+05C1) Qamats (U+05B8) Qof (U+05E7) Segol (U+05B8) Resh (U+05E8) - Shin (U+05E9) Shin Dot (U+05C1) Qamats (U+05B8) Qof (U+05E7) Segol (U+05B8) Resh (U+05E8) - Dalet (U+05D3) Dagesh (U+05BC) Sheva (U+05B0) Bet (U+05D1) Patah (U+05B7) Resh (U+05E8) - Dalet (U+05D3) Dagesh (U+05BC) Qamats (U+05B8) Bet (U+05D1)
Parsers for the UnicodeSet notation?
I would like to work with the exemplarCharacters data in the CLDR. That uses the UnicodeSet notation. Is there somewhere a parser for that notation, that would return me just the list of characters in the set? Something a bit like the UnicodeSet utility at http://unicode.org/cldr/utility/list-unicodeset.jsp, but for use in apps/shell. I suspect that the exemplarCharacters use a restricted form of the UnicodeSet notation (e.g. do not use property values). Is that correct, and if so, what's the subset? Incidentally, I copy/pasted the punctuation exemplar characters for he.xml into the utility, and it reported that the set contains 8,130 code points, including the ascii letters. Somehow, that seems incorrect. What did I do wrong? Thanks, Eric. ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Help with arabic
I am working of the digitization of a text that includes arabic; could somebody please tell me what is the Unicode representation of the (short) fragments on those two pages? http://gallica.bnf.fr/ark:/12148/bpt6k6439352j/f33.image http://gallica.bnf.fr/ark:/12148/bpt6k6439352j/f474.image Thanks, Eric. ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Time to learn French!
http://www.forbes.com/sites/pascalemmanuelgobry/2014/03/21/want-to-know-the-language-of-the-future-the-data-suggests-it-could-be-french/ http://www.france24.com/en/20140326-will-french-be-world-most-spoken-language-2050/ http://www.boston.com/bostonglobe/ideas/brainiac/2014/03/the_language_of_1.html Et cetera. Eric. ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: Editing Sinhala and Similar Scripts
On 3/19/2014 7:57 AM, Peter Constable wrote: It is nonsensical to talk about erasing a _keystroke_. undo, revert the effect of a keystroke. The concept is meaningful. Eric. ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Transforming BidiTest.txt to the format of BidiCharacterTest.txt
Does anybody have a program that transforms the UCD file BidiTest.txt to the format of BidiCharacterTest.txt, and that they are willing to share? Thanks, Eric. ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Representation of neutral tone in pinyin and bopomofo
Is it correct that: - in bopomofo, the neutral (or light) tone is represented by U+02D9 ˙ DOT ABOVE, and in the text representation, that character follows the bopomofo characters of the syllable (just like all the other characters for tones) - in pinyin, the neutral tone is typically not marked, but it may be marked. When that's the case, U+02D9 ˙ DOT ABOVE is used. When U+02D9 is used in pinyin, where it is in the character sequence? before the syllable to which it applies (where it is displayed) or after (like in bopomofo)? When U+02D9 is used in bopomofo, it needs to be displayed before the syllable. Is the display position simply before the nearest preceding character from the set {U+3105 ㄅ BOPOMOFO LETTER B ... U+3119 ㄙ BOPOMOFO LETTER S, U+31A0 ㆠ BOPOMOFO LETTER BU ... U+31A3 ㆣ BOPOMOFO LETTER GU}? Thanks, Eric.
Re: Can the combining diacritical marks combine with any base character?
On 2/11/2013 12:49 AM, Richard Wordingham wrote: The problem sequence is U+003E GREATER-THAN SIGN, U+0338 COMBINING LONG SOLIDUS OVERLAY which is canonically equivalent to U+226F NOT GREATER-THAN. Which demonstrates: NFC applied to the serialization of an XML infoset is not the same as NFC applied to the text nodes and attributes of that infoset. The short answer is that XML shall not do canonical equivalence, at least, not on data; so doing would corrupt some of the CLDR definitions, That case is different: it's whether a use of text strings (CLDR in this case) can be indifferent to normalization. There are other cases, e.g. the regular expressions to validate some of Unihan's properties, which should not be normalized, and which assume that the data to be validated is in NFD. Eric.
Re: Case-folding dotted i
On 1/24/2013 2:15 AM, Richard Wordingham wrote: If text is going to be processed, i+dot is wrong for Turkish, as the Unicode casing rules for Turkish will capitalise it to I+dot+dot, which should display with two dots. If you're going to use an explicit dot, I'd have said U+0131, U+0307 would be better, though I still think using an explicit dot is wrong in general. Richard. Six abstract characters (hard dotted, dotless, soft dotted in 2 cases) for four coded characters, something has to break somewhere. With the current practice, there is inherent ambiguity. The current practice is tolerable only in the presence of locale information. In which case the addition of combining dots in case transformation is useless, and in fact harmful as Richard showed. Adding more characters (as in creating the hard dotted form by using the dotless + combining dot) breaks current practice. Eric.
Re: Too narrowly defined: DIVISION SIGN COLON
On 7/11/2012 9:20 AM, Julian Bradfield wrote: Unicode is about plain text. TeX is about fine typesetting. Too narrowly defined: Unicode. I think Unicode is not just for plain text, but rather concerns itself with only the lower layer of /any /text system. When it's plain text, Unicode has the burden of solving all the problems. When it's a richer system, there is the issue of cooperation between the layers, a situation that Unicode cannot ignore. Eric.
Record-A-Thon is tomorrow – help record 50 langauges in a single day
From http://blog.mightyverse.com/2011/06/300-languages-record-a-thon/ On July 30th, 2011 we will meet at the Internet Archive in San Francisco, where volunteers will record the Universal Declaration of Human Rights http://www.un.org/en/documents/udhr/index.shtml (UDHR) in their native language(s). Mightyverse volunteers will assist recording at several recording stations. Each station will be equiped with a video camera, monitor, lighting, microphone and Mightyverse PhraseFarm teleprompter system to enable the capture of spoken language. These high quality recordings of native speakers will be made available at archive.org http://archive.org/ under a Creative Commons license. Mightyverse is excited to support the Long Now Foundation http://longnow.org/’s 300 languages project in its July 30th 2011 record-a-thon http://rosettaproject.org/record-a-thon/. The goal of the 300 languages project is to record spoken language that has parallel translations in at least 300 languages. Towards that effort, Laura Welcher and her team at The Rosetta Project http://rosettaproject.org/ (an ongoing effort by The Long Now Foundation) have identified texts that already exist in parallel translations. Of those texts, we at Mightyverse were especially excited by the UDHR. Signup for UDHR Recording: https://spreadsheets.google.com/spreadsheet/viewform?formkey=dEM2cW9wSm4za0VmSHZwTEI2amxhNUE6MQ Also, it will be a fun day with free form language recording, some speakers at the beginning of the day and at lunch and there'll be food and prizes for people who record. Eric.
Re: OpenType update for Unicode 5.2/6.0?
I entirely second Peter's description. Let’s keep this in perspective: consider just how much progress there has been in the last ten years. IMHO, we can all be grateful to Microsoft in that area. I don't believe any other company or group has been as instrumental in bringing real solutions to wide audiences. Eric.
Re: Derived age regexp
On 10/15/2010 3:19 PM, Tim Greenwood wrote: Is there any regular expression - in perl, or elsewhere, that enables searching on the derived age? I want to find all characters in a file added since Unicode 4.1. I could write it all by processing against the derived age file, but it would be nice if it is ready to go. Xquery on the XML representation of the UCD is your friend. Eg --- declare namespace u = http://www.unicode.org/ns/2003/ucd/1.0;; for $c in doc('ucd.all.flat.xml')//u:ucd/u:repertoire/u:ch...@age = 4.1] return concat ($c/@cp, , $c/@age, , $c/@na, #xa; ) --- Eric.
Re: Bengali Script
On 7/8/2010 5:09 PM, Tulasi wrote: Ok I am correcting - Bangladeshi to Bengali. The Government of West Bengal / Society for Natural Language Technology Research (a member of the Consortium) has a very strong preference for the term Bengla rather than Bengali. Eric.
Re: Titlecasing iota subscript
See also the FAQ, http://www.unicode.org/faq/greek.html#6 Eric.
NYT article: Using a New Language in Africa to Save Dying Ones
http://www.nytimes.com/2004/11/12/international/africa/12africa.html?ex=1101365144ei=1en=b4b60fe9706acc9b http://www.nytimes.com/2004/11/12/international/africa/12africa.html?ex=1101365144ei=1en=b4b60fe9706acc9b Eric.
Re: [africa] Unicode IDNs
Works for me by clicking on the link in Chris's message. Mozilla 1.7 Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7) Gecko/20040616. Running on some configuration of XP SP2. The navigation bar show "http://www..net". Eric.
Re: Looking for a C library that converts UTF-8 strings from their decomposed to pre-composed form
Deborah Goldsmith wrote: It's worth pointing out that there is no such thing as precomposed Unicode. Normalization form C (NFC) could be called as precomposed as possible. There are some sequences of Unicode that can only be expressed using combining marks. As well as single (precomposed) characters which have a sequence of more than one character as their NFC form. So NFC is not even as precomposed as possible. Eric.
Re: Looking for the UDHR in Thai
Ed, thanks for the pointers. I'll be in touch with you off-list. In the mean time, I have Hindi, Sanskrit, Magahi and Bohjpuri versions at http://www.rawbw.com/~emuller/unicode/index.html. Eric.
Looking for the UDHR in Thai
I am using the various versions of the Universal Declaration of Human Rights at http://www.unhchr.ch/udhr/index.htm as test material. Unfortunately, the Thai version is an image, and the resolution is not good enough for me to even attempt to retype the document. Can somebody point me to either better images, or even better to a text version (any encoding, with or without markup)? The only thing I have found so far is http://bkk2.loxinfo.co.th/~aithnd/tudhr.html, but this does not seem to be the complete text: the preamble is missing, is the text is much, much smaller than any of the other languages. Thanks, Eric.
Re: CYRILIC CAPITAL/SMALL LETTER PE WITH DESCENDER
I have found a couple of books on Abkhaz, and I have scanned some pages. They are at http://www.unicode.org/~emuller/abkhaz. Page 88 of the second book is a reproduction of a third book, which seems very interesting, but that I have not been able to locate. Eric.
CYRILIC CAPITAL/SMALL LETTER PE WITH DESCENDER
It seems that Abkhaz, written in Cyrillic, uses a PE WITH DESCENDER, but I can't find this case pair in Unicode. I am missing something, or do we need to encode those? Evidence: - Daniels and Bright, p717, table 60.15, right column, 6th entry. - Universal Declaration of Human Rights, at http://www.unhchr.ch/udhr/lang/abk.htm By the way, the UDHR document uses its own character set. I have attached a tentative TEC map (for SIL's TECkit tool) to convert to Unicode. The PE with descender are converted to U+FFFD followed by 1 for the uppercase and 2 for the lowercase. Thanks, Eric. EncodingNameAbkhaz DescriptiveName Abkhaz Version 1.0 Contact mailto:[EMAIL PROTECTED] pass (Unicode) 0x32 0x2116 0x25 0x40f 0x35 0x45f 0x29 0x4ac 0x30 0x4ad 0x2a 0xfffd 0x31 0x38 0xfffd 0x32 0x2b 0x4be 0x3d 0x4bf 0x3a 0x49a 0x36 0x49b 0x3f 0x4b4 0x37 0x4b5 0x401 0x40e 0x451 0x4e1 0x419 0x49e 0x439 0x49f 0x429 0x4b2 0x449 0x4b3 0x42a 0x4d8 0x44a 0x4d9 0x42d 0x4bc 0x44d 0x4bd 0x42e 0x4a8 0x44e 0x4a9 0x42f 0x4f6 0x44f 0x4f7 0x2116 0x4b6 0x33 0x4b7 0x402 0x30 0x403 0x31 0x404 0x32 0x405 0x33 0x406 0x34 0x407 0x35 0x408 0x36 0x409 0x37 0x40a 0x38 0x40b 0x39
Re: CYRILIC CAPITAL/SMALL LETTER PE WITH DESCENDER
Michael Everson wrote: At 08:12 -0700 2004-09-28, Eric Muller wrote: It seems that Abkhaz, written in Cyrillic, uses a PE WITH DESCENDER, but I can't find this case pair in Unicode. I am missing something, or do we need to encode those? U+04A6, U+04A7 are used in Abkhaz for that sound, I believe. Isn't it problematic to have the distinction between (MIDDLE) HOOK and DESCENDER for GHE (494/4F6), KA (4C3/49A) and arguably EN (4C7/4C9) but not for PE? That being said, I am not trying to beat the master of disunification 8-) If we agree that 4A6/7 is it, then we need at least an annotation can be rendered with a descender instead of a hook, or may be go all the way for a change of the representative glyph to use a descender, since that is the form used in both DB and in the Abkhaz font. Thanks, Eric.
Re: valid characters in user names- esp. compatibility characters
Tex Texin wrote: However, I am curious as to whether some Users might read/write their names using compatibility characters (esp. in ideographic markets) and object to the characters being normalized through nfkc. There is a further problem there, because the CJK compatibility characters have a *canonical* decomposition. The UTC is working on some scheme for Ideographic Variation Sequences, and the intent is to use that to solve the canonical equivalence problem. Eric.
Re: Writing Tatar using the Latin script; new characters to encode?
Mark E. Shoulson wrote: Unicode exists to support what people use. Do people use Latin script for Tatar? Evidence indicates that they do. Should Unicode support it, then? Certainly. Does Unicode support it? Yes, Unicode supports the Latin script, with gobs of extensions. So what's the problem? Latin n with descender, which is not encoded but needed according to http://www.eki.ee/letter/chardata.cgi?lang=tt+Tatarscript=latin. Eric.
Re: Proposal to encode dominoes and other game symbols
Philippe Verdy wrote: A suggestion for playaing cards: why not including the Tarots? I mean in French the 4 Cavaliers figures, the 18 Atouts, and the Excuse (which is not exactly a Joker); sorry I don't have their English names. Make that 21 atouts (labeled 1 through 21), for a total of 78 cards. The cavalier is between the jack and the queen. Very popular game in high school and college in my days. Eric.
Question on CLDR number patterns
The decimal pattern for Arabic/Kuwait contains U+0660 ARABIC-INDIC DIGIT ZERO, apparently for the MinimumInteger part (using the Java DecimalFormat terminology), presumably to select the set of Arabic digits. However, this mechanism does not seem to be part of the Java patterns, so I suspect it was added by CLDR. But the best description I have been able to find is in UTR #35: The numbers element supplies information for formatting and parsing numbers and currencies. It has three sub-elements: symbols, numbers, and currencies. The data is based on the Java/ICU format. The currency IDs are from [ISO4217]. For more information, including the pattern structure, see [JavaNumbers]. (The last pointer goes to the J2SE 1.4.1 documentation, and Sun says "Products listed on this page have completed the Sun End of Life process. ", btw) So where can I find the documentation on the use of something other than U+0030 0 DIGIT ZERO in a CLDR number pattern? Thanks, Eric.
Re: Question on CLDR number patterns
Mark Davis wrote: The decimal format looks like the following: #,##0.###;#,##0.###- I was actually looking the locales through the ICU explorer, which apparently replaces the localizable characters by those specified in the symbols, hence my confusion. (We should add documentation in the futureso that we don't depend on anything from Sun.) Or anybody else, and not just because the links are not stable. BTW, it would probably be better to float your questions on the CLDR mailing list. Will do. That's the old unicore vs. unicode, but it's now cldr vs. unicode. Thanks, Eric.
Re: ISO 15924 French name Gotique: a typo...???
Michael Everson wrote: Collins-Robert Senior Dictionnaire Franais-Anglais Anglais-Franais gothique [architecture, style] Gothic. criture ~ Gothic script That means Fraktur gotique [ling] Gothic That means Wulfilan Stet. Le Petit Robert (1987) concurs with your assement: --- GOTIQUE. Voir GOTHIQUE 3 GHOTIQUE 3 criture gothique, criture caractres droits, ... Ling. n. m. Le gothique ou (plus souvent) GOTIQUE, langue des Goths, rameau oriental des langues germaniques. --- So does Vendryes in Fossey: the relevant chapter title is criture gotique. Eric.
Writing Tatar using the Latin script; new characters to encode?
According to www.eki.ee, there is a currently an effort to convert the writing of Tatar from Cyrillic to Latin. 1. Does somebody have more information about that effort? Eki lists four characters as needed but missing in Unicode (see http://www.eki.ee/letter/chardata.cgi?lang=tt+Tatarscript=latin). 2. The case pair for barred o is encoded (U+019F and U+0275), and it seems that their confusion comes from less-than-perfect but annotated name for U+019F, and from the usage remark African. Can we authoritatively tell them that those two characters are the ones they want? Can we add a Tatar usage remark to both? 3. The case pair n with descender is definitely not encoded, and from my memory of the discussion of ghe with descender, we would want to encode them as separate characters (rather than with combining descenders on n). Is anybody working on that proposal? Thanks, Eric. PS: sorry for the double post to unicode and unicore. However, given the current state of [EMAIL PROTECTED], this seems the best course of action.
Re: GB18030 and super font
Raymond Mercier wrote: But that link to proofing tools leads nowhere. Maybe it's not be so easy to get the CHS version. http://www.amazon.com/exec/obidos/tg/detail/-/BBZ54P/qid=1082651762/sr=8-1/ref=pd_ka_1/103-8333725-5907026?v=glances=softwaren=507846 Includes ~140 fonts, mostly for CJK, Arabic, Hebrew but other scripts as well. Includes "Simsun (Founder Extended)" aka "-", with 65,531 glyphs! Eric.
Re: GB18030 and super font
Raymond Mercier wrote: Mark Shoulson writes their Super Font is bundled with Microsoft Office XP, and even Microsoft's prices haven't gotten that high! >From Microsoft, http://www.microsoft.com/globaldev/DrIntl/columns/015/default.mspx : "A font that contains Simplified Chinese glyphs from both CJK Extension A and B sets is "SimSun (Founder Extended)" (SurSong.ttf in the system), The following ideographs characters are apparently mapped: all of the URO, all 12 unified ideographs from the CJK Compatibility Ideographs block, all of Ext A, 36,862 of Ext B (out of 42,711), some of the compatibility ideographs of the CJK Compatibility Ideographs block, none of the CJK Compatibility Supplement. Eric.
Re: Fixed Width Spaces (was: Printing and Displaying DependentVowels)
Kenneth Whistler wrote: Uh, no. ZWNBSP, SPACE, ZWNBSP is equivalent to NBSP. I suspect that equivalent is only for some aspects. In particular, NBSP has a bidi category of CS, which means that A 0NBSP7 B (in bidi notation) displays as B 0 7 A, while A 0ZWNBSP, SPACE, ZWNBSP7 B displays as B 7 0 A. Eric.
Re: Converting between Shift-JIS and Unicode
Rick Cameron wrote: Could you please point me to information on the relationship between JIS X 0208-1990 (as represented by the kJis0 field in Unihan.txt) and Shift-JIS? The JIS X 0208 and JIS X 0213 include a description (in Japanese, but with pictures) of the relationship. They are available on line: go to http://www.jisc.go.jp/; click on the blue button labeled JIS in the Search section; enter JIS X 0213 in the third input box; click on the attached button; select JIS X 0213 in the resulting list. The first PDF is about the changes in :2004, the others are actually :2000. Similarly for 0208. You can also consult Ken Lunde's excellent book: CJKV Information Processing, ISBN 1-56592-224-7 (http://www.oreilly.com/catalog/cjkvinfo/index.html). If you intend to do almost any work on CJK stuff, you need those two sources, and some more. Eric.
Re: LATIN SMALL LIGATURE CT
Peter Constable wrote: Adobe included this and other ligatures in their use of the PUA for their own legacy reasons; More specifically, to allow applications which do not have OpenType layout to display those ligatures. it has otherwise never been necessary for them to do so in their Pro fonts, Indeed, InDesign does not use those code points. and I believe they are moving away from that practice. Not until the majority of applications support OpenType layout, and, more importantly, the majority of documents have been migrated to not use those PUA characters. In other words, probably never for existing fonts. Eric.
Re: collation of small capitals
Philippe Verdy wrote: The most common use I have seen of small capitals is as a font style, where they were used to represent lowercase letters (the uppercase letters being presented with full-height style). It is not proper to encode English, French, ..., text that is eventually rendered using small capital glyphs using the characters U+1D00 LATIN LETTER SMALL CAPITAL A and friends. One should use the regular letters, either uppercase (e.g. in acronyms and such) or lowercase (e.g. at the beginning of chapter) as appropriate. One way to choose between uppercase and lowercase is to consider what happens if only the uppercase and lowercase glyphs are available. The fact that those character occurrences are intended to be rendered using small cap glyphs does not change their identity. Conversely, the identity of the LATIN LETTER SMALL CAPITAL characters is closely connected to IPA, UPA and, more generally, phonetic texts. This is what TUS 4.0 is saying page 171, under Typographic variants. Eric.
Re: Useful Breton links
I'd really like to know more about Breton, [...] it is not supported by public schools Breton is taught in public schools in France, including in bilingual programs in elementary schools (about 50 schools in 2002). Look for Div Yezh. Kenavo, Eric.
Re: Useful Breton links
Michael Everson wrote: And Skol Diwan. These are indeed schools that teach Breton and other subjects in Breton, but they are not public schools, in the sense of being run by the state. There are true public schools that teach in Breton, and Div Yezh is a parent's association that promotes their development; according to their site, there are about 50 such schools, enrolling about 3000 students. Of course, we are speaking K-12, or about 3 to 18 years old. In those schools, the teaching is half in French and half in Breton. That being said, it is true that France is doing about as little as it can to support Breton and other regional languages, in schools and elsewhere. Eric.
Re: OT: Free Fonts
John Hudson wrote: ClearType is a proprietary renderer that Microsoft don't share with anyone. Although that is changing: http://www.infoworld.com/article/03/12/03/HNmicrosoftip_1.html Towards the end: Microsoft expects most licensing arrangements to be made one-on-one with interested companies. Formal programs such as the ClearType and FAT system arrangements will be relatively rare, Kaefer said. Microsoft picked those two technologies as its licensing guinea pigs because it had outside vendors interested in becoming customers. Agfa Monotype Corp. plans to use ClearType-related technology in its iType font rendering system, [...] Eric.
Linguistic Diversity and National Unity: Language Ecology in Thailand
I just finished reading Linguistic Diversity and National Unity: Language Ecology in Thailand by William Smalley, University of Chicago Press, ISBN 0-226-76288/9, and I found it very interesting. However, I have no reference to judge it against. Can anybody comment on it? Any significant change since it was published 10 years ago? Thanks, Eric.
Re: Hacek - Typing from a keyboard... Help!!!!
Rick McGowan wrote: Caron [...] is *NOT* in current use at all in English. It is widely used in the typography community, for better or for worse. Eric.
Re: Unicode and Script Encoding Initiative in San Jose Mercury News
Doug Ewell wrote: [...] about You see, boys and girls, computers think only in numbers -- in a Silicon Valley paper, [...] Should we tell them about real quotes? real quotes are not just for Web publication; they are also for email. Throw in real dashes, of the kind en or em you prefer Eric. 8-)
Re: Some questions about fractions
Jill Ramonsky wrote: I'm wondering, exactly how equivalent are the following sequences: U+00BC (vulgar fraction one quarter) U+215F U+0034 (fraction numerator one; digit four) U+0031 U+2044 U+0034 (digit one; fraction slash; digit four) In particular, should they be rendered with the same glyph? I would rather phrase the question as "if all characters involved are supported by a given layout system, should those three sequences produce the same visible result? The first change is to account for an implementation not supporting, e.g., U+215F. The second change is to allow different glyph organizations, as long as they produce the same ink. That being said, I think the answer is yes, assuming non digits around the last one. Is it possible to compose a single glyph for (say) twenty two over seven, using the fraction slash? Sure. It is ok to have a single glyph for arbitrarily large fragments of text, fractions or not. If a glyph displaying whole word is available and appropriate for the circumstances, a rendering engine could use it. If I were to write "one quarter" as U+0031 U+2044 U+0034, how should I then write "one and a quarter"? Is there a "fraction space" which I should use to separate the "1" from the "1/4"? Unicode 4.0 Section 6.2 p 159 has the answer: If the fraction is to be separated from a previous number, then a space can be used, choosing the appropriate width (normal, thin, zero width, and so on). For example 1 + THIN SPACE + 3 + FRACTION SLASH + 4 is displayed as 1. Eric.
Re: Michael Everson in the news
See also http://www.technologyreview.com/articles/innovation10903.asp, which is apparently about SEI. Eric.
Re: W3C Objects To Royalties On ISO Country Codes
See also http://news.com.com/2100-1032-5079256.html. Eric.
Re: About that alphabetician...
Michael Everson wrote: An Irish colleague here said he liked the article but noted that the Times' web directors don't use Unicode ... meta http-equiv="charset" content="iso-8859-1" ... There is an alternative point of view, which says that charset declared in an HTML (or XML) document is no more than an encoding scheme, and that all characters in those documents are fundamentally Unicode characters (i.e. they start in life with the full semantic of Unicode, they don't inherit it on the occasion of character set conversion). That view is supported by the XML spec itself, and by the infoset definition. And because we have numeric character entities, using an iso-8859-1 encoding scheme is not really a limitation: witness this message, which contains U+10DB GEORGIAN LETTER MAN and U+092E DEVANAGARI LETTER MA. Eric.
Re: Questions on Myanmar encoding
Thank you very much for your help. I don't know what you mean third row of Table 10.3. It is in Unicode 4.0, section 10.3, page 273, and you can see it at: http://www.unicode.org/versions/Unicode4.0.0/ch10.pdf#G24999 With current model.. 1018 102C 1039 101B 1031 102C 1010 102C 101C 1032 0020 1001 1004 1039 200C 1017 1039 101A 102C 1038 = What you said? correct. But it used Space(0020), current rules said to use ZWSP. Ok. 1021 1013 1031 101B 102D 1000 1014 1039 200C 1012 1031 102C 1039 200C 101C 102C 0020 1042 1048 0020 1042 002C 1040 1040 1040 0020 1000 1030 100A 102E= US$28 2,00 ...? I think help? 1000 1030 100A 102E 1015 102C Just one character wrong 1031on third place should be 1012. my original: 1021 1013 1031... your correction: 1021 1013 1012 ... I am a bit confused, and looking more carefully, my new guess is: 1021 1019 1031... Apparently, that makes the first word sound like american. And there should be no space between 18 2,00. ok 1010 102D 101B 1005 1039 1006 102C 1014 1039 200C 1025 101A 1039 101A 102C 1025 1039 200C 1018 102F 102D 1037 0020 2018 1015 1004 1039 200C 1012 102C 1014 102E 2019 0020 101B 1031 102C 1000 1039 200C = 'PandaNi' for zoo... I don't know what red Panda mean but flow is correct just one big mistake there is no 1025 1039 200C LetterU Killer. The characters after 1021 to 102A can not use as character of Killer. It is 1009 1039 200C. Character 1009 NYA have two glyph. What you see on Unicode is normal form glyph. Another form glyph is similar, you can say same, with 1025 U used only to pressed killer and Character, that have subscript, also another kind of killer. I think I understand. Also, I corrected 1018, which should be 101E. Nice to see you and I'm also wish to change some encoding rules but every body said TOO LATE. Just to be clear, I am not proposing any modification to the encoding model. At best, I can think of clarifications that could help people like me, who have limited knowledge of the script. In another place in your message, you mention that the current model is not optimal for sorting. I am not a specialist of sorting, but this is not an entirely unusual situation. It is in general not possible to make the encoding model such that it is optimal for all processings (rendering, sorting, etc.) You may want to check carefully the UCA, to see if and how it can handle proper sorting. Eric.
Questions on Myanmar encoding
1. What is encoded by the sequence of characters U+1004 MYANMAR LETTER NGA U+1039 MYANMAR SIGN VIRAMA U+1004 MYANMAR LETTER NGA is it kinzi + consonant NGA or consonant NGA+ subscript consonant NGA? Should we add some words to Table 10.3 to clarify that? 2. Does consonant + subscript consonant NGA ever appear? If so, how is it rendered? If not, should we remove U+1004 from the third row of Table 10.3? 3. About Table 10.3: it is true that *in the encoding model* a cluster is always made of one element of each row, with row 2 (consonant) mandatory and the other rows optional? 4. Is that model realistic, or are there some exceptions, that is real life situations that it does not capture? Of cases where the encoding is possible, but not intuitive (e.g. two clusters in the encoding instead of one)? 5. Is is correct to view the kinzi as a medial form of NGA, which just happens to be encoded at the front of the cluster? For what values of correct? 6. Finally, I have tried to encode various strings I have seen in print (or rather as pictures of printed stuff). I would really appreciate if somebody could check my encodings. By the way, I found the introduction to the Burmese script on that site very interesting. In particular, not having to consider encoding made the presentation more accessible (i.e. it provides the level of expertise needed to understand the Composite Characters subhead in section 10.3). Thanks, Eric. http://www.seasite.niu.edu/Burmese/PictureGallery/headin1.jpg U+1018 MYANMAR LETTER BHA U+102C MYANMAR VOWEL SIGN AA U+1015 MYANMAR LETTER PA U+1039 MYANMAR SIGN VIRAMA U+101B MYANMAR LETTER RA U+1031 MYANMAR VOWEL SIGN E U+102C MYANMAR VOWEL SIGN AA U+1010 MYANMAR LETTER TA U+102C MYANMAR VOWEL SIGN AA U+101C MYANMAR LETTER LA U+1032 MYANMAR VOWEL SIGN AI U+0020 SPACE U+1001 MYANMAR LETTER KHA U+1004 MYANMAR LETTER NGA U+1039 MYANMAR SIGN VIRAMA U+200C ZERO WIDTH NON-JOINER U+1017 MYANMAR LETTER BA U+1039 MYANMAR SIGN VIRAMA U+101A MYANMAR LETTER YA U+102C MYANMAR VOWEL SIGN AA U+1038 MYANMAR SIGN VISARGA http://www.seasite.niu.edu/Burmese/PictureGallery/headin2.jpg U+1019 MYANMAR LETTER MA U+102C MYANMAR VOWEL SIGN AA U+1010 MYANMAR LETTER TA U+102D MYANMAR VOWEL SIGN I U+1000 MYANMAR LETTER KA U+102C MYANMAR VOWEL SIGN AA http://www.seasite.niu.edu/Burmese/PictureGallery/headin4.jpg U+1021 MYANMAR LETTER A U+1013 MYANMAR LETTER DHA U+1031 MYANMAR VOWEL SIGN E U+101B MYANMAR LETTER RA U+102D MYANMAR VOWEL SIGN I U+1000 MYANMAR LETTER KA U+1014 MYANMAR LETTER NA U+1039 MYANMAR SIGN VIRAMA U+200C ZERO WIDTH NON-JOINER U+1012 MYANMAR LETTER DA U+1031 MYANMAR VOWEL SIGN E U+102C MYANMAR VOWEL SIGN AA U+1039 MYANMAR SIGN VIRAMA U+200C ZERO WIDTH NON-JOINER U+101C MYANMAR LETTER LA U+102C MYANMAR VOWEL SIGN AA U+0020 SPACE U+1042 MYANMAR DIGIT TWO U+1048 MYANMAR DIGIT EIGHT U+0020 SPACE U+1042 MYANMAR DIGIT TWO U+002C COMMA U+1040 MYANMAR DIGIT ZERO U+1040 MYANMAR DIGIT ZERO U+1040 MYANMAR DIGIT ZERO U+0020 SPACE U+1000 MYANMAR LETTER KA U+1030 MYANMAR VOWEL SIGN UU U+100A MYANMAR LETTER NNYA U+102E MYANMAR VOWEL SIGN II http://www.seasite.niu.edu/Burmese/PictureGallery/headin10.jpg U+1021 MYANMAR LETTER A U+1039 MYANMAR SIGN VIRAMA U+101D MYANMAR LETTER WA U+1014 MYANMAR LETTER NA U+1039 MYANMAR SIGN VIRAMA U+200C ZERO WIDTH NON-JOINER U+101C MYANMAR LETTER LA U+102F MYANMAR VOWEL SIGN U U+102D MYANMAR VOWEL SIGN I U+1004 MYANMAR LETTER NGA U+1039 MYANMAR SIGN VIRAMA U+200C ZERO WIDTH NON-JOINER U+1038 MYANMAR SIGN VISARGA U+1017 MYANMAR LETTER BA U+102E MYANMAR VOWEL SIGN II U+1007 MYANMAR LETTER JA U+102C MYANMAR VOWEL SIGN AA U+101C MYANMAR LETTER LA U+1039 MYANMAR SIGN VIRAMA U+101A MYANMAR LETTER YA U+1039 MYANMAR SIGN VIRAMA U+101F MYANMAR LETTER HA U+1031 MYANMAR VOWEL SIGN E U+102C MYANMAR VOWEL SIGN AA U+1000 MYANMAR LETTER KA U+1039 MYANMAR SIGN VIRAMA U+200C ZERO WIDTH NON-JOINER U+1014 MYANMAR LETTER NA U+102F MYANMAR VOWEL SIGN U U+102D MYANMAR VOWEL SIGN I U+1004 MYANMAR LETTER NGA U+1039 MYANMAR SIGN VIRAMA U+200C ZERO WIDTH NON-JOINER http://www.seasite.niu.edu/Burmese/PictureGallery/zoo.ht1.jpg U+1010 MYANMAR LETTER TA U+102D MYANMAR VOWEL SIGN I U+101B MYANMAR LETTER RA U+1005 MYANMAR LETTER CA U+1039 MYANMAR SIGN VIRAMA U+1006 MYANMAR LETTER CHA U+102C MYANMAR VOWEL SIGN AA U+1014 MYANMAR LETTER NA U+1039 MYANMAR SIGN VIRAMA U+200C ZERO WIDTH NON-JOINER U+1025 MYANMAR LETTER U U+101A MYANMAR LETTER YA U+1039 MYANMAR SIGN VIRAMA U+101A MYANMAR LETTER YA U+102C MYANMAR VOWEL SIGN AA U+1025 MYANMAR LETTER U U+1039 MYANMAR SIGN VIRAMA U+200C ZERO WIDTH NON-JOINER U+1018 MYANMAR LETTER BHA U+102F MYANMAR VOWEL SIGN U U+102D MYANMAR VOWEL SIGN I U+1037 MYANMAR SIGN DOT BELOW U+0020 SPACE U+2018 LEFT SINGLE QUOTATION MARK U+1015 MYANMAR LETTER PA U+1004 MYANMAR LETTER NGA U+1039 MYANMAR SIGN VIRAMA U+200C ZERO WIDTH NON-JOINER U+1012 MYANMAR LETTER DA U+102C MYANMAR VOWEL SIGN AA U+1014 MYANMAR LETTER NA U+102E MYANMAR VOWEL SIGN II U+2019 RIGHT SINGLE QUOTATION
Re: Faulty ligatures in Adobe PhotoShop
Doug Ewell wrote: Anto'nio Martins-Tuva'lkin antonio at tuvalkin dot web dot pt wrote: The bad part of it is that the ligated characters shown (in the sencond and third examples) seem to include a long "s" instead of an "f"... ty_06.gif attached for reference. Thanks for the report, Ill forward to the Photoshop guys. By the way, the font is apparently Adobe Caslon Pro. Substituting an unligated i (U+017F + U+0069) for fi (U+0066 + U+0069) makes no sense at all. If the current font doesn't contain an ligature (U+FB01), Photoshop should just leave the combination alone. More likely, the image was created in Illustrator or some such, and the glyph selected manually by the author. I did not check explicitly, but I am ready to bet a whole lot that the font does the correct thing. Eric.
Re: Unicode 4.0 is online at last!
Peter Kirk wrote: And indeed the software being used is produced by a consortium member. Perhaps the embarrassment should be more that member's, that their software is not Unicode compatible. The member in question is a company. Companies are not embarrassed nor ashamed. whilst implementing the full standard, including all scripts... Even the ones that would be newly defined in the next version... ;-) Actually, we don't need that much: there are very few strings set in other scripts, and for those, it would be quite acceptable to do the typesetting by hand - in fact, that's precisely what is done today. Eric.
Re: Does Unicode 3.1 take care of all characters of 'Hong Kong SupplimentaryCharacter Set - 2001' (HKSCS-2001) ?
John McConnell wrote: The mapping of the HKSCS 2001 repertoire to ISO/IEC 10646-2:2001 has 35 mapped to the private use area 1651 mapped to supplementary plane 2 511 mapped to the Extension A block (on the BMP) 2212 mapped to the CJK Ideographic block (also on the BMP) plus another 278 mapped elsewhere on the BMP 35 + 1651 + 511 + 2212 + 278 = 4687. HKSCS 1999 has 4,702 characters, and HKSCS 2001 adds 116, for a total of 4818. I believe the that 131 unaccounted for in your decomposition are in the mapped elsewhere pile, which should be 409. Eric.
Re: From [b-hebrew] Variant forms of vav with holem
Mark Davis wrote: The UTC accepts and considers proposals from other parties (see http://www.unicode.org/pending/proposals.html for submitting a proposal for new characters). For complex matters (which this definitely seems to be, based on the volumn of mail!), it is far and away the best if someone can attend the appropriate UTC meeting to explain the details of the proposal, with the pros and cons of different approaches. And be aware that not every UTC member will be up to speed at all. Personally, I have next to zero knowledge of Hebrew, and I did not read the 400 messages on the subject. I am willing to learn, but it is not going to happen without a self-contained proposal. I would also appreciate if we can get pointers to background material on Biblical Hebrew that is readily available in the US (please make the subject line specific and new if you send that to [EMAIL PROTECTED]). Eric.
Re: U+23D0 VERTICAL LINE EXTENSION
Alan Wood wrote: I think this leaves only one character in the old Symbol font that does not have a Unicode equivalent: RADICAL EXTENDER (decimal 96 in the Windows version) When I prepared the proposal for U+23D0 VERTICAL LINE EXTENSION, it was indeed to ensure the complete representation of some other character set in Unicode. My target was actually the PUA usage defined by Adobe, which included what's needed for the Symbol font. I did not consider perfect round tripping a necessity: it was enough for me to allow the conversion of old data to Unicode, and to leave the old world behind. Nor do I consider having a perfect handling of symbol pieces in a Unicode only world a necessity: exchanging with somebody the plain text ...U+23B2 SUMMATION TOP U+23B3 SUMMATION BOTTOM ... does not improve one bit our communication over exchanging ...U+2211 N-ARY SUMMATION... Nor do I consider symbol pieces a good solution for typesetting (*glyphs* for the symbol pieces may be a good thing for that problem, but that requires more communication between a layout engine and a font than a mapping from characters to glyphs). For the RADICAL EXTENDER, I could not convince myself that such a character was needed; U+23AF HORIZONTAL LINE EXTENSION is a fine character to use for that purpose. U+23D0 VERTICAL LINE EXTENSION was much easier to justify (i.e. nothing else made sense) and there was the model of U+23AF HORIZONTAL LINE EXTENSION to build on. This represents only my opinion, and explains why I did not propose RADICAL EXTENDER. It says nothing about how the UTC would react to such a proposal. Eric.
Re: U+1D29
Anto'nio Martins-Tuva'lkin wrote: I've just downloaded the PDF files with 4.0 additions (U40-*.pdf). One question: How is one supposed to tell apart the glyphs for U+1D29 and U+1D18?... Or one isn't?... In the same way that you tell apart the glyphs for U+0050 P LATIN CAPITAL LETTER P and U+03A1 ? GREEK CAPITAL LETTER RHO? Eric.
Re: ISO 8859_2 and Windows 1250
Otto Stolz wrote: CP 1250 contains the ISO 8859-1 characters, hence it is not suited for slavic laguages. I suspect that Otto meant to type CP 1252 contains... Eric.
Re: Handwritten EURO sign
The latest issue of Baseline (www.baselinemagazine.com) has an article on the Euro. I did not read it, so I don't know if it speaks of handwritten forms. Sign of the times: the euro currency symbol by Conor Mangat. Eric.
Re: 4701
Michael Everson wrote: Happy New Year of the Yáng to everybody! (I can't work out whether it's the Year of the Sheep, the Goat, or the Ram.) Ram. Eric.
Re: urban legends just won't go away!
Barry Caplan wrote: Who knew in this day and age flipping bits to change case is still publishable (this is from today!) What I find a lot more objectionable is that what this code pretends to do is not defined (in particular, the domain over which it applies). Without such qualification, we cannot say if the code is correct or not, no matter how fishy it looks. In fact, this example is a perfectly valid implementation if the system pretends to handle only an appropriate subset of the Unicode character set. For more information, see http://www.cs.utexas.edu/users/EWD/. Eric.
Re: Documenting in Tamil Computing
I don't understand what you meant by Unicode not being mature enough to support multilingual emails. Maybe the argument is simply that there are not enough email agents that can render Tamil properly from Unicode-encoded text, and that email rarely has a useful life that justifies pain today. Eric.
Re: converting devanagari to mangal unicode
In order to convert any Devanagari font to be rendered in the same way, May be Sunil is just asking for a conversion of data, presumably from ISCII to Unicode. Eric.
TDIL information on Indic languages.
This may be of interest for people working with Indic languages. Eric. Original Message Subject: [li18nux:1096] Re: Linux Future Survey Date: Thu, 26 Sep 2002 16:40:36 +0530 From: "Dutta Abhijit" [EMAIL PROTECTED] Reply-To: [EMAIL PROTECTED] To: [EMAIL PROTECTED] Hello We have been working to provide "standard" descriptions of various Indian languages for developers to use. The information is provided here: http://tdil.mit.gov.in/news.htm . See the newsletters for January and April 1. http://tdil.mit.gov.in/tdiljan2002.pdf (Sanskrit, Hindi, Marathi, Konkani, Sindhi, Nepali) 2. http://tdil.mit.gov.in/tdil-april-2002.pdf (Gujarati, Malayalam, Oriya, Gurmukhi and Telugu ) Any thoughts ? Regards, Abhijit
[OT] looking for electronic dictionaries
For my personal use, I would like to acquire electronic dictionaries, principally for the major European languages, with the following characteristics: - reputable source - raw datafiles accessible - I appreciate the interfaces that dictionary vendors may provide, but I want to be able to write my own code to find the data I am looking for - the wordlist is the principal aspect; I can live without definitions. - markup about the structure of words, for things like hyphenation, etc. (or from which hyphenation can be derived) - some form of frequency count would be nice For example, I'd like to compute something like: the average French character occupies x bytes in UTF-8, with average defined in sync with the frequency count. And I'd like to compute things like spelling changes introduced by hyphenation in Dutch. Any pointers? Thanks, Eric.
Re: New version of TR29:
Your definition of LatinVowel is problematic. Is Y only a vowel in French? In a word such as yeux, it certainly is a consonant. Could this lead to problems? I don't think so, but I wait for the opinion of French speakers. What I can see is that things like l'yaourt [lja'ur] are normal in French spelling, and sometimes are to be found also in Italian (l'yoghurt ['ljogurt]). y is either a vowel or a semi-consonant. When a semi-consonant, an initial y does not cause elision, so le yaourt. Of course, there are exceptions: yeuse (oak), yèble (?) and yeux (eyes). The usage is both ways for yole (skiff). There are a few words starting with a vowel y: y (there), ypérite (mustard gas), ytterbium (?), yttrium (?). Finally, there is elision before most proper nouns starting with Y: Yonne (a river), York, etc. That being said, here are a few problematic cases for your proposal: prud'homme (a member of an industrial tribunal) is a single word, as are his relatives prud'homal, and prud'homie. Grevisse (Le bon usage, the authority on French usage) gives five verbs which are considered a single word: entr'aimer (s'), entr'apercevoir, entr'appeler (s'), entr'avertir (s'), entr'égorger (s'); Le Petit Robert (1988, a well respected dictionary) gives only the second one. There is elision before the names of the consonants f, h, l, m, n, r, s, x: admissible à l'X (accepted at X = École Polytechnique), devant l'n (before the n). grand'mère is definitely one word for me, but grand'rue, grand'chose are not so clear. All are archaic forms and Le Petit Robert does not list any of those (modern: grand-mère, rue principale, grand chose'). Then there is spoken French: j'suis allé m'promener for je suis allé me promener (I went for a walk). There are many such cases of elision before a consonant. This spoken French is of course very close to many dialects, or even close languages (e.g. Picard, spoken in the North of France). Did we mention that one never breaks a line after an apostrophe that represents elision? Speaking of French line break problems, there is also the case of the ;, which takes a space before and after: foo ; bar. Of course, one never breaks on the space just after foo. Same for :. Eric.
OCR characters
In our OCR fonts, we have two glyphs named erase (looks like a black square) and grouperase (looks like a long dash). I don't have a copy of the OCR standards, but I suspect those are mandated by these standards. On the other hand, and I can't find traces of those in Unicode, so I suspect they have been unified. But with which characters? More generally, are there other things like that we should aware of? Thanks, Eric.
Re: Missing character glyph- example
John Hudson wrote: but it should *not* be encoded as U+ or as any other codepoint. .notdef should be unencoded. Almost. OpenType specifies that there is no functional difference between a code point that is not mapped and a code point that is explicitly mapped to GID 0, so there is never a need to map any code point to GID 0. But at the same time, there is no prohibition against mapping explicitly a code point to GID 0. Eric.
Re: [OpenType] library for identifying equivalent sequences
I don't have what you are looking for [canonically equivalent strings], but I am curious how you plan to go from that to: (The underlying issue is that I'm trying to figure out, given some precomposed glyph in a font, what are all the valid substitutions that could be applied in the smart-font code.) Eg. don't you also want the strings that contain a sprinkling of ZWJ, ZWNJ, CGJ, SHY and various other things? Eric.
Re: ZWJ and Latin Ligatures
[EMAIL PROTECTED] wrote: This was one of the basic design criteria in order to ensure that support for a script could be added by building a font using tools assessible to people with less that than C-programming skills and without requiring any re-write of software. Actually, the goal of easily add shapping for a new orthography and the goal of do not duplicate in all fonts what really belongs to an orthography are not as incompatible as we paint them. For example, there could a be plug-in mechanism (and those plug-ins could be written in a special-purpose language) for Uniscribe. Eric.
Re: [OpenType] Proposal: Ligatures w/ ZWJ in OpenType
I just reread the Unicode 1.0 standard on ZWNJ and ZWJ (p77), and it seems very similar to the the 3.2 explanation (although not as detailed). Am correct in thinking that the intents are the same, except may be for Indic scripts, or is there some other difference I did not spot? Eric.
Re: [OpenType] Proposal: Ligatures w/ ZWJ in OpenType
The mechanism proposed by John to handle ZWJ/ZWNJ makes the implicit assumption that those characters are transformed into glyphs (via the usual 'cmap' mechanism) and that this is the avenue to transfer the intent of those characters to the shaping code in the font (i.e. some kind of ligature lookup). I'd like to revisit that assumption. The ZWJ/ZWNJ characters are formatting characters. Their function is definitely different from the function of the "regular" characters (such as "A"): they are a way to control the rendering of regular characters around them, and to express that control in plain text. The debate so far shows that there is no strong objection to that mechanism by itself. In an environment richer than plain text, there is obviously the possibility that this control could be expressed by other means than characters. In the OpenType world, and in particular in the interface between the layout engine and the shaping code in fonts, we have more than plain text, or rather plain glyphs; we also have a description of which features should be applied to which glyphs. So instead of having glyphs that stand for ZWJ/ZWNJ, can we use these features? In fact, we already do that every day. For example, an InDesign user can insert the two characters x and y, and apply a ligature feature (let's say 'dlig') to them. It seems to me that this is just what ZWJ is about. So InDesign could do the following given the character sequence x ZWJ y: map it the glyph sequence cmap(x) cmap(y), with 'dlig' applied on those two glyphs. This 'dlig' application takes precedence over one via UI, i.e. it happens regardles of whether the user requested 'dlig' explicitly. The ZWJ character is simply not mapped to the glyph stream, since the feature application does the job of ZWJ. We can handle ZWNJ in the same way: the sequence x ZWNJ y is transformed to the glyph sequence cmap(x) cmap(y), with 'dlig' not applied on those two glyphs. This 'dlig' non-application takes precedence over one via UI, i.e. 'dlig' is not applied to these two glyphs regardless of whether the user requested 'dlig' explicitly. [May be a better way of thinking about the precedence stuff is to think entirely in markup terms: ligatures-on ... x ZWNJ y ... /ligatures-on is transformed in the glyph stream dlig ... cmap(x) /dlig dlig cmap(y) ... dlig, i.e. dlig is off on the pair x y; hold your objection that a feature is applied to a position rather than a range for a minute.] With this approach, we gain two things. First, not having a "formatting" glyph for ZWJ is IMHO a huge conceptual win, even bigger than not having a "formatting" character ZWJ would be. Second, what John's proposal did not mention (or may be I missed it) is that it's not just the ligature features that have to deal with this glyph, it is all the features; compound that by all the formatting characters, and you will start to understand Paul's reaction. It's interesting to note that this approach can be applied to other formatting characters as well. Either their intent can be achieved by the layout engine alone, without help of the font, in which case there is no need to show anything to the code in the font; no glyph and no feature are consequence of those characters. Or their intent needs help of the font, and the OpenType way to ask for this help is to apply (or not) features. All that takes care of selecting a ligature, but it does not quite take care of selecting cursive forms. I can see how we could define 'dlig' to do that (or define a 'zwj' feature that invokes the ligature lookups plus some single substitution lookup), but I am not sure I am happy with that. In fact, I am not sure I am happy with that clause in Unicode. Eric. [About the features applied to ranges rather than positions: think about it and it should be obvious 8-) It does not make sense to apply a ligature at a position; what makes sense is to apply a ligature on range. Think about 1-n substitutions; whatever lookups apply to the source glyph should also apply to all the replacement glyphs - ranges again. I even believe that this approach is compatible with the current OpenType spec. More details on demand.]
Re: Acrobat, Unicode, Advanced usage
Greenwood, Timothy wrote: This question is pertinent to one asked me the other day for which I did not have an answer. Is the code set of an original document relevant for PDF - say EUC, SJIS, PDF - will the output perform text searches correctly for differing code set inputs? PDF documents logically contain two streams: one of characters, and one of glyphs. The glyph stream is always present physically, and is used for rendering. Depending on the fonts involved, the PDF generator, and all sorts of factors, the meaning of the numbers in that glyph stream, and the machinery to locate the actual outlines will vary quite a bit. The character stream can be represented explicitly, in which case I am pretty sure it is always a Unicode stream. Alternatively, it can be computed from the glyph stream using various mechanisms; I believe that all the computations described in the PDF spec generate a Unicode stream. The choice of explicit vs implicit character representation is up to the PDF producer. In all cases, I believe that the producer has the responsibility of converting from whatever character standard is used in the original document to Unicode. When the producer is Distiller, it may not have access to the original character content and be forced to create an approximation. Eric.
What does Z variant mean for Han?
In the description of the Han script (section 10.1), the Z axis is described by: The actual shape (typeface) attribute (the Z axis) is for differences of type design (the actual shape used in imaging) of each variant form. Let's take the concrete example of U+5516 and U+555E, with kZVariant entries pointing at each other in Unihan.txt. Does Z variant mean that all the glyphs which are acceptable to represent U+5516 are also acceptable to represent U+555E, and conversely? Of course, some shapes may be more appropriate in some circumstances, much like a Fraktur shape may be more appropriate than a Roman shape in some circumstances. Does it go as far as making folding from one character into the other a useful operation, assuming one is not interested in perfect round-tripping with other character standards? If a document that contains U+5516 is rendered, and one does a copy/paste from a rendering of that document, is it acceptable to paste U+555E? Are the answers to those questions different for other pairs of characters that are z variant one of the other? Thanks, Eric.
Re: Hexadecimal characters.
For the scripts which have their own digits, are there conventions to write hexadecimal numbers with those digits? If I read a Devanagari text book, will I see 20A7, or २०?७ (where ? stands for whatever is used for A)? Thanks, Eric.
Re: Encoding of symbols, and a lock/unlock pre-proposal
Markus Scherer wrote: [EMAIL PROTECTED]">They had practical uses when user interfaces and display systems could not handle icons and arbitrary images, but those times are long over. I wish this was the case, but most if not all systems insist that graphics stored in a font be accessed as characters. This puts pressure on encoding symbols. Fonts as packages of graphics are unequaled in some respects: they support a unique combination of geometric and structural information (hinting); the later is vital at low rendering resolutions and gives the designer a say in what can be dropped and what should be preserved. Yes, the rendering resolution of a given system continually improves (e.g. printers, desktop screens), but at the same time we keep inventing new classes of devices (e.g. PDA, phones), where cost constraints put us back at low resolution. they offer a convenient way to put multiple, usually related, images in one package the publishing industry has figured out how to manage them (not that it's easy or that they got a lot of help...) Of course, they have many limitations (e.g. monochrome, no transparency, etc). Nevertheless, it is useful to package images for symbols in fonts. The problem is that essentially all systems insist that characters be used to access the content of a font. I don't know of any system where I can specify a graphic as "this glyph in this font", on the same terms as I can specify "this .gif". Eric.
Re: Greek Extended: question: missing glyphs?
David J. Perry wrote: If I were to make a complete OT Greek font, with all the above as well as the combinations already in Unicode, which would provide better performance: substitutions or positioning via OT features? There is a similar thread on the [EMAIL PROTECTED] mailing list. My argument is that you cannot expect positioning to be more complete because it will form some combinations with apparently less work. Regardless of whether you use substitution or positioning, it's the testing of the actual combinations, alone and in context, that is the bottleneck. The difference between the two methods (which can be characterized as combination at the factory and combination at the installation site) are second order factors. In the case of CFF outlines, which have a notion of subroutine, the size difference is not that big. What is going to drive your choice is the ease of creating combination glyph vs. the ease of creating GPOS lookups in your font development environment, and how much you are willing to depend on Unicode/OpenType support in the target environment (I know some software that is more likely to handle substitutions than mark positioning). Eric.