[OT?] Uniscribe for Malayalam and Oriya
Hallo everybody. I hope this is not (too) OT. In case it is, can somebody please redirect me to a more appropriate forum? I am trying to display Malayalam and Oriya on an MS Win2000 system. The fonts I am using are OpenType and have all glyphs and tables needed by thee scripts, but I don't still see the glyphs come out correctly. As not even *reordering* is done, I guess that my Uniscribe DLL does not support these scripts. Are they implemented in newer versions of Uniscribe? If yes, where can I get it? Thanks in advance for any help. -- Marco Cimarosti
[OT] The nice thing about standards...
Hallo everybody! I received this in the mail, and I thought it could be of interestfor the Unicode mailing lits: Aragonese - Lo geno d'as normas ye que aiga tantas entre ras que se puede eslexir. Asturian - Lo bono de les normes ye qu'hai munches onde escoyer. Basque - Arauen alderik onena da aukeratzeko horrenbeste izatea. Brazilian Portuguese - A coisa boa dos standards a sua incrvel variedade. Bresciano - El bl de gli standar l' che ghe n' tach. Breton - Ar pep gwella a-fed ar standardo eo o liested. Calabrese - La cosa bella tra li standard ca ci ni su tanti ca putimu scegli. Catalan - El bo de les regles s que n'hi tantes entre les quals escollir. Croatian - Zgodno u standardima jest da ih je toliko mnogo na izbor. Dutch - Het leuke van normen is dat er eindeloos veel zijn om uit te kiezen. English - The nice thing about standards is that there are so many of them to choose from. Esperanto - Bona afero pri normoj estas ke ekzistas tiom da ili por elekti. Estonian - Normide juures on kige toredam see, et neid on nii palju - ole mees ja vali! Finnish - Standardien hyv puoli on siin, ett niiss on niin paljon valinnanvaraa. Flemish - Het leuke aan normen is dat je zo'n uitgebreid assortiment hebt om uit te kiezen. French - La plus grande qualit d'un standard est sa multiplicit. Friulian - La robe biele dai standard j che 'a son tancj par sielgiju. Galician - O bo das normas que hai moitas para escoller. German - Das Beste an Standards ist ihre groe Auswahl. Griko - To prama rrio attus standard ene ka echi tossu amsa is pu sozzi jaddhzzi. Hungarian - Az a j dolog a szabvnyokban, hogy olyan sok kzl lehet vlasztani. Italian - La cosa bella degli standard che ce ne sono tanti tra cui scegliere. Judeo-Spanish - Lo bueno de las normas es ke son tantas. Ke se puede eskojer entre eyas. Latin - Bonum rationum est quod plurimae sunt inter quas eligere possumus. Limburguese - Wo zoe hnnig s on norme s daste ter zovel van hbs vr aut te kieze. Modenese - Al bl di mod l' c'a gh 'na bla slta. Napolitan - 'O bbello r''e mudille ca se ponno scegliere mmiez'a ttante. Papiamento - E kos bon di normanan ta ku tin hopi fo'i kua por skohe. Parmisan - Al bel di standard l' ch' a gh'n na mucha pr poder cernir. Piedmontese - Ln ch'a l' bel d ij stndard a l' ch'as peul serne antre vaire. Polish - Zaleta standardw jest to, ze bedac licznymi, jest z czego wybierac. Portuguese - O bom das normas que h moitas por onde escoller. Romagnolo - Quel che da gost t'le banl l' ch'u i un foto. Romano - Er bello delle regole che ce ne so' 'n sacco tra cui pote' sceje. Sicilian - 'U bellu d' 'i standard ca ci nn' un mari ammenzu a cu' scegliri. Spanish - Lo bueno de las normas es que hay tantas entre las que se puede elegir. Swedish - Det trevliga med standarder r att det finns s mnga att vlja p. Venetian - El belo dei 'standard' xe che ghen' un saco par far na selta. (More here, of course in Unicode: http://www.verba-volant.net/pls/vvolant/datecorps2.dictionary?data=20-OCT-04 ) -- Marco
RE: UTF to unicode conversion
Mike Ayers wrote: Side 1 (print and cut out): ++---+---+--+ | U+ | yy zz |Cima's UTF-8 Magic | Hex= | | U+007F | ! ! |Pocket Encoder | B-4 | | YZ | . . | | | ++---+---+ Vers. 1.0 | 0=00 | | U+0080 | 3x xy | 2y zz | 16 March 2000 | 1=01 | [...] Holy Hermes Trismegistos, I had forgotten this one! How's it I had all that free time back in March 2000? :-) _ Marco
RE: lines 05-08, version 4.7 of Roadmap to BMP and 'Hebrew extens ions'
Rick McGowan wrote: I mistakenly thought Tifinagh was rtl. That's OK. It has been, and sometimes still is, written right to left, hence it was roadmapped in a right-to-left allocation block. However, in modern usage, and in the Moroccan national standard now being drafted, it is specifically left to right. Is the draft of this Moroccan standard on-line somewhere? TIA. _ Marco
RE: Bob Bemer, father of ASCII, has died
[\]{}
RE: Latin long vowels
Anto'nio Martins-Tuva'lkin wrote: On 2004.06.22, 16:20, Marco Cimarosti wrote: You can also compose them with the normal letter followed by character MODIFIER LETTER MACRON (code 02C9, decimal 713). Oops! You mean U+0304 : COMBINING MACRON (decimal: 772). Yes, right, sorry. (Hey, that's why it didn't wanto to go on top of the bloody vowel! :-) _ Marco
RE: Latin long vowels
Joe Speroni wrote: I apologize for a simple question, but after a few hours of research I don't seem to be able to find the characters needed. Funny: I see them in my Windows Character Map utility at the first hit on Page Down key... I'm trying to scan a Latin text that uses a bar over the vowels to indicate long sounds. Do these characters exist in Unicode? Uppercase: 0100, 0112, 012A, 014C, 016A (decimal: 256, 274, 298, 332, 362) Lowercase: 0101, 0113, 012B, 014D, 016B (decimal: 257, 275, 299, 333, 363) You can also compose them with the normal letter followed by character MODIFIER LETTER MACRON (code 02C9, decimal 713). If so, would anyone know from where a Windows XP font containing these five characters could be download? Several fonts which come pre-installed in Windows NT, 2000 or XP have those characters, e.g. Arial, Times New Roman and Courier. Aloha, Aloha? Is it for Hawaiian that you need the macrons? In that case also notice that, afaik, the proper character for the glottal stop letter (aka apostrophe or okina) is MODIFIER LETTER TURNED COMMA (code 02BB, decimal 699). _ Marco
RE: [OT] Even viruses are now i18n!
Antoine Leca wrote: The virus cannot have any knowledge of a language code. And much less of the language used by its next victim... It sends e-mails to addresses stolen from the previous victim's address list, so it can analyze the top-level domain of these addresses (.it, .fr, etc.). Although, strictly speaking, these domains normally correspond to *country* codes, they are a pretty good hint of the language of the next victim. _ Marco
[OT] Even viruses are now i18n!
It seems that even the virus industry is getting global! F-Secure Virus Descriptions : NetSky.X [...] Netsky.X sends messages in several different languages: English, Swedish, Finnish, Polish, Norwegian, Portuguese, Italian, French, German and possibly the language of some small island called Turks and Caicos, located in the Atlantic ocean. [...] According to whether the destination address is one of the following domains: [...] It will compose messages in the corresponding language, choosing from the following parts. Bodies chosen from: [...] mutlu etmek okumak belgili tanimlik belge. Behaga lsa dokumenten. Haluta kuulua dokumentoida. Podobac sie przeczytac ten udokumentowac. Behage lese dokumentet. Leia por favor o original. Legga prego il documento. Veuillez lire le document. Bitte lesen Sie das Dokument. Please read the document. (from http://www.f-secure.com/v-descs/netsky_x.shtml) _ Marco
RE: [OT] Even viruses are now i18n!
Peter Kirk wrote: mutlu etmek okumak belgili tanimlik belge. ... This is Turkish, of a sort. The virus writers have presumably confused .tc and .tk, as this Turkish is the first body listed and .tc is the first domain listed. Yes, and the translation was probably done translating word by word the English sentence Please read the document. The English word the was translated as belgili tanmlk, which actually means definite article! _Marco
RE: help finding radical/stroke index at unicode.org
Gary P. Grosso wrote: Judging by what we saw in the back of the Unicode 2.0 book, we would tend to say that it is correct that (in an index) 21333 (0x5355) is sorting under 21313 (0x5341) instead of 20843 (0x516b). I am looking for some table of radicals that I can show our customer to help support that claim. Perhaps I should start by asking for opinions on the above sorting, and for guidelines on how best to govern such decisions, [...] As Ken Whistler said, you don't necessarily have to make such a decision. The usual policy in a dictionary-like radical/stroke index is to put ambiguous characters under *multiple* radicals, so that they are easily found whatever the reader's assumption. A computer radical/stroke search utility is supposed to be at least as user friendly as old paper indices. Please notice that the Unihan.txt database contains most of the raw data you need to build such a comprehensive index. The data is contained in these fields: kRSUnicode kRSJapanese kRSKanWa kRSKangXi kRSKorean kRSUnicode is Unicode's default radical/stroke index (the one which was used assign the code point to CJK characters), while the other ones are alternate radical/stroke from a variety of sources. E.g., for character U+5355 ( = lone), Unihan.txt contains the following kRS... entries: U+5355 kRSKangXi 12.6 U+5355 kRSUnicode 24.6 These entries tell you that while Unicode puts U+5355 under the 24th radical (U+2F17 = U+5341 = = ten), the Kang Xi Zidian dictionary puts it under the 12th radical (U+2F0B = U+516B = = eight). Basically, if you extract all the kRS... fields, ignore the field identifier, sort them by their radical/stroke index, discard duplicates (i.e., entries with the same index and code point), and you obtain a list quite close to a dictionary-like index with ambiguous characters under multiple radicals. _ Marco
RE: Unicode 4.0.1 Released
Rick McGowan wrote: Unicode 4.0.1 has been released! [...] The main new features in Unicode 4.0.1 are the following: [...] 3. Unicode Character Database: [...] * Changed: general category of U+200B ZERO WIDTH SPACE * Changed: bidi class of several characters (If I am asking a FAQ, I apologize in advance...) So far, my understanding was that the normative properties of existing code points where carved in stone. Won't these fixes break applications out there? I.e., won't they turn previously conformant applications into non conformant ones? _ Marco
RE: help needed with adding new character
Michael Everson wrote: What organization uses the ANARCHY SYMBOL? ;-) The anarchist movement. Why are you winking? Ciao. Marco
[OT] Freedom and organization (was RE: help needed with adding ne w character)
Kenneth Whistler wrote: Why is an Anarchist asking to standardize something? Why not!? Can you elaborate on this? Myself, I am an anarchist sympathizer, and I have been deeply interested in a character encoding standard for nearly ten years now... Anarchism is against imposing forms of organization, non against organization itself. And standards are quite like the useful side of laws (the organization) without the harmful side (the imposition), so they should be welcome to anarchists. Obedience to laws can only be imposed with violence by a parasitical clique (cops, tribunals), whereas compliance to standards is achieved solely by the intrinsic usefulness and quality of the standard's design. _ Marco
RE: help needed with adding new character
Jon Wilson wrote: I disagree that the anarchy symbol is not a character used in the representation of words. I can write a word beginning with A with either a simple LATIN CAPITAL LETTER A, or with an Anarchy symbol, or with an existing CIRCLED LATIN CAPITAL LETTER A. You can also write an M using the Macdonald logo (I have seen it on actual Macdonald's advertising), or an I using a jumping table lamp (I have seen it in a cartoon by Pixar). But we don't need to allocate Unicode code points for the Macdonald logo or for a jumping table lamp, do we? Such things are not independent letters, but just graphic variations of the ordinary letters A, M, I, made with the purpose of adding some kind of overtone (ideological, commercial, humoristic). I also disagree that the Anarchy symbol has no use within a text. I do not doubt that I can find examples of published texts where the anarchy symbol is used throughout. Beware of saying that isn't real text just because the character isn't currently in Unicode. The code should represent usage, not the other way round. I understand that finding such text is probably crucial to a successful application. Yes, I think that this is THE very point that you have to demonstrate before filing your proposal. And I bet that the success of your proposal will depend almost entirely on how good this demonstration is. The point is not so much to demonstrate that the symbol exists (that's quite obvious: we've all seen it), or that it is unique (that's quite obvious too, IMHO: it is both graphically and semantically different from the current Unicode circled A): the point is to demonstrate that it is *TEXT*, i.e. that some piece of text could not be encoded without it. I am a subscriber of at least two anarchist magazines (printed on *paper*, so Unicode digital encodings are not at issue here), and I don't recall having ever seen an anarchy symbol used *within* the body text of these magazines. For sure, the symbol is ubiquitous *near* the text: it is used as a logo on the magazines' title or in third party's advertising; it is used as typographic decoration; it appears on the flags in the photos of rallies and demonstrations... But, as far as I can recall, it is never uses as part of a sentence. Of course, knowing about your proposal, I will look with doubled attention all next issues, and I will send you any specimen of the symbol used as text. But I am quite skeptical. _ Mrco
RE: [OT] Freedom and organization (was RE: help needed with addin g ne w character)
Peter Kirk wrote: Come to think of it, a not very large group of them with a bit of money behind them could buy enough votes to outvote the corporations and destroy Unicode - Yes, right, interesting possibility! Not that much money either: a single punk rock concert would probably raise enough funds. But, before I propose this to the World Wide Anarchist Consortium, could you please mention at least valid motive for the action? They would object that, as the anarchists are by nature anti-nationalistic and internationalist, the anarchist on-line forums and web sites tend to be multi-language, and Unicode is quite an useful tool for that. _ Marco
RE: OT? Languages with letters that always take diacriticals
Curtis Clark wrote: Are there any languages that use letters with diacriticals, but *never* use the base letter without diacriticals? AFAIK, Thaana is such a case. Unlike Indic scripts, Thaana has no inherent vowel, so each consonant letter always takes either a vowel mark or the sukuun (= no-vowel mark). _ Marco
RE: Web Form: Other Question: Etruscan,Sanscrit Linear B on ibo ok G4
John Jenkins wrote: Anybody understand what he means by there is unicode gamma of characters but it is not complete? I guess unicode gamma of characters is Italinglish for Unicode character set. (Italian gamma means repertoire, range, scale, set.) _ Marco
RE: Unicode forms for internal storage - BOCU-1 speed
Jon Hanna wrote: I refuse to rename my UTF-81920! Doug, Shlomi, there's a new one out there! Jon, would you mind describing it? _ Marco
RE: Detecting encoding in Plain text
Peter Kirk wrote: This one also looks dangerous. What do you mean by dangerous? This is an heuristic algorithm, so it is only supposed to work always but only in some lucky cases. If lucky cases average to, say, 20% or less then it is a bad and useless algorithm; if they average to, say, 80% or more, then it is good and useless. But you can't ask that it works in the 100% of cases, or it wouldn't be heuristic anymore. Some scripts include their own digits and punctuation; not all scripts use spaces; and controls are not necessarily used, if U+2028 LINE SEPARATOR is used for new lines. Yes, but *all* these circumstances must occur together in order for the algorithm to be totally useless for *that* language. If a certain Unicode plain text file uses ASCII punctuation OR spaces OR end-of-line characters, AND the file is not too short or has a very odd formatting, then the algorithm should work. But there may be some characters U+??00 which are used rather commonly in a particular script and so occur commonly in some text files. And those text files will not be detected correctly, particularly if they are very short: that's part of the game. _ Marco
RE: Detecting encoding in Plain text
Jon Hanna wrote: False positives can be caused by the use of U+ (which is most often encoded as 0x00) which some applications do use in text files. I have never seen such a thing, can you make an example? I can't imagine any use for a NULL in a file apart terminating records or strings but, of course, a file containing records or string is not what I would call a plain-text file, anyway not a typical plain-text file. The method can be used reliably with text files that are guaranteed to contain large amounts of Latin-1 But the Latin-1 (or even just ASCII) range contains some characters which are shared by most languages (space, new line and/or line feed, digits, punctuation), so there should be a relatively large amount of Latin-1 characters in most cases. Even scripts which have their own digits or punctuation often prefer European digits punctuation, especially in computer usage. E.g., it suffices to check a few websites (or even printed matter) in Arabic to see that European digits are much more widespread than native digits. _ Marco
RE: Detecting encoding in Plain text
Peter Kirk wrote: What do you mean by dangerous? This is an heuristic algorithm, so it is only supposed to work always [...] (I meant: it is not supposed to work always) I would not consider an 80% algorithm to be very good - depending on the circumstances etc. But if for example 20% of my incoming e-mails were detected with the wrong encoding and appeared on my screen as junk, [...] In this case (as in most other similar cases), you should rather blame the people who send you e-mail without encoding declaration. Auto-detection should be the last resort, when you have no safest way of determining the encoding. Yes, but *all* these circumstances must occur together in order for the algorithm to be totally useless for *that* language. [...] True. But there may be certain languages (perhaps Thai?) for which all of these circumstances regularly occur together. I don't think that Thai would be such a case. Thai normally uses European digits (the usage scope of Thai digits is probably similar to that of Roman numerals in Western languages), some European punctuation (parentheses, exclamation marks, hyphens, quotes), and spaces (although a Thai space has the strength -- and hence the frequency -- of a Western semicolon). As a minimum, all languages should use line feed and/or new line as line terminators, as Unicode's line and paragraph separators never caught on. _ Marco
RE: Chinese rod numerals
Christopher Cullen wrote: (2) The Unicode home page says: The Unicode Standard defines codes for characters used in all the major languages [...] mathematical symbols, technical symbols, [...]. I suggest that in an enterprise so universal and cross-cultural as Unicode, the definition of what counts as a mathematical symbol has to be conditioned by actual mathematical practice in the culture whose script is being encoded. I think that Ken Whistler point was simply this: OK, Chinese rod numerals may be symbols, but were these symbols used in *writing*? Not all symbols are used in writing, and only symbols used in writing are suitable to be part of a repertoire of, well, encoding symbols used in writing... A flag, a medal, a tattoo, T-shirt may definitely be calle4d symbols, yet Unicode does not need a code point for Union Jack or Che Guevara T-Shirt. To stick to mathematics, a pellet on an abacus, a key on an electronic calculator, or a curve drawn on a whiteboard may legitimately be considered symbols for numbers or other mathematical concepts. Yet, Unicode does not need a code point for abacus pellet, or memory recall key, or hyperbola with horizontal axis, because these symbols are not elements of writing. IMHO, in your proposal you should provide evidence that the answer to the above question is yes. I.e., you don't need to prove that these symbols were used in Chinese mathematics, but rather that they were used to *write* something (numbers, arguably, or arithmetical operations, etc.). _ Marco
RE: Detecting encoding in Plain text
Doug Ewell wrote: In UTF-16 practically any sequence of bytes is valid, and since you can't assume you know the language, you can't employ distribution statistics. Twelve years ago, when most text was not Unicode and all Unicode text was UTF-16, Microsoft documentation suggested a heuristic of checking every other byte to see if it was zero, which of course would only work for Latin-1 text encoded in UTF-16. I beg to differ. IMHO, analyzing zero bytes is a viable for detecting BOM-less UTF-16 and UTF-32. BTW, I didn't know (and I don't quite care) that this method was suggested first by Microsoft: to me, it seems quite self-evident. It is extremely unlikely that a text file encoded in any single- or multi-byte encoding (including UTF-8) would contain a zero byte, so the presence of zero bytes is a strong enough hint for UTF-16 (or UCS-2) or UTF-32. The next step is distinguishing between UTF-16 and UTF-32. A bullet-proof negative heuristic for UTF-32, is that a text file *cannot* be UTF-32 unless at least 1/4 of its bytes are zero. A positive heuristics for UTF-32 is detecting sequences of two consecutive zero bytes, the first of which having an odd index: as it is very unlikely that a UTF-16 file would a NULL character, zero 16-bit words must be part of a UTF-32 character. The combination of these two methods is pretty enough to tell apart UTF-16 and UTF-32. Once you determined whether the file is in UTF-16 or in UTF-32, a statistical analysis of the *indexes* of zero bytes should be pretty enough to determine the UTF's endianness. UTF-16 is likely to be little-endian if zero bytes are more frequent at even indexes than at odd indexed, and vice versa. This is due to the fact that, in any language, shared characters in the Latin-1 range (controls, space, digits, punctuation, etc.) should be more frequent than occasional code points of form U+??00. For UTF-32, determining endianness is even simpler: if *all* bytes whose index is divisible by 4 are zero, then it is little-endian, else it is big-endian. Of course, all this works only if it is true the basic assumption that the file is a plain text file: this method is not quite enough for telling apart text files from binary files. _ Marco
RE: Punched tape (was: Re: American English translation of chara cter names)
Anto'nio Martins-Tuva'lkin wrote: |O OoOO | |O oOOO | | OOo O O| |OO oOO | | O o OO| |OO o| |O Oo OO | |O o| | OOo OOO| |O OoO | | OOo OOO| |O o OOO| «\N5l#oVO7X7G»? _ Marco
RE: Unicode-ASCII approximate conversion
Hallvard B Furuseth wrote: I need a function which converts Latin Unicode characters to the closest equivalent ASCII characters, e.g. é - e. Before I reinvent the wheel, does any public domain or GPL code for this already exist? I don't know, sorry. If not, for the most part I expect I can make the mapping from the character names, e.g. ignore 'WITH ACUTE' in 'LATIN CAPITAL LETTER O WITH ACUTE' in ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt. Why the name!? The decomposition property (5th filed on each line) is much better for this. E.g.: 00E9;LATIN SMALL LETTER E WITH ACUTE;Ll;0;L;0065 0301N;LATIN SMALL LETTER E ACUTE;;00C9;;00C9 The decomposition field tells you that é (code 00E9 hex) is composed of ASCII e (code 0065 hex) and the combining acute accent (code 0301 hex): you keep the ASCII character and drop the composing accent. Punctuation and other non-letters will be worse, but they are less important to me anyway. The result is much better if you allow the ASCII conversion to be a string. This allows you to, e.g., © = (c), ½ = 1/2, and so on. This is also good for letters: ß = ss, å = aa, etc. _ Marco
RE: American English translation of character names
John Cowan wrote: In the New York City subway system (of underground trains, that is, not underground pedestrian tunnels!), this letter has been consistently avoided since 1967, when the system of distinguishing trains by letter or number was instituted. The only other letters never used are I and O (presumably to avoid confusion with 1 and 0, though 0 has never been used either), and Y. Why Y is a mystery to me: perhaps there has simply never been a need for it. Probably, having to get train Why? to reach one's workplace could have a negative effect on employees' attitude towards hard working. _ Marco
RE: [OT] CJK - CJC (Re: Corea?)
Doug Ewell wrote: I'll go farther than that. It's always bothered me that speakers of European languages, including English but especially French, have seen fit to rename the cities and internal subdivisions of other countries. Rightly said! There is reason to rename Colonia to Kln, Augusta to Augsburg, Eboraco to York, Provincia to Provence, and so on. _ Marco
RE: [OT] CJK - CJC (Re: Corea?)
Michael Everson wrote: At 11:04 +0100 2003-12-17, Marco Cimarosti wrote: There is reason to rename Colonia to Köln, Augusta to Augsburg, Eboraco to York, Provincia to Provence, and so on. Nicely said. Subtle irony tends to go over some people's heads on this list though. Especially if one forgets an essential no. :-( It should have been There is NO reason to rename... Eboraco is called Eabhrac in Irish. :-) So, that's who set the bad example in the first place! When the Angles came they said: if Britanni can mangle place names, why shouldn't Ingevones? :-) Ciao. Marco
RE: Arabic Presentation Forms-A
Philippe Verdy wrote: #code;cc;nfd;nfkdFolded; # CHAR?; NFD?; NFKDFOLDED?; # RIAL SIGN fdfc;;;isolated 0631 06cc 0627 0644; # ??; ?; ?; The Arial Unicode MS font does not have a glyph for the Rial currency sign so I won't comment lots about it, even if it's a special ligature of its component letters: - where the medial form of U+06CC ARABIC LETTER FARSI YEH (?) is shown on charts only as two dots (and not with its Arabic letter alef maksura base form, as the comment in Arabic chart suggests for Arabic letter yeh), which is I am not sure I understand what you are asking, but it is quite normal that the initial and medial form of letters Beh, Teh, Theh, Noon and Yeh loose their tooth and are thus recognizable only by their dots. Similarly, Seen and Sheen often loose their three teeth. I find this particularly puzzling with the initial and medial forms of Seen, which becomes a simple straight line in most calligraphic styles. - located on below-left of the medial form of U+0627 (?) , U+627 is Alif, so it has no medial form. - and where the initial form of U+0631 (?) kerns below its next two characters (sometimes with an aditional kashida below its next three characters). This too is quite normal: the tail of Reh, Zain and Waw often kerns below the next letter. Compare it to Latin lowercase j, which has a similar behavior. _ Marco
RE: [OT] CJK - CJC (Re: Corea?)
Doug Ewell wrote: This seems very misguided, if true. Alphabetical primacy can hardly be considered an effective measure of the relative power or importance of a nation. [...] Remember that in the time frame in question, the late '30s and early '40s, three of the major world powers were the United States, the United Kingdom, and the Soviet Union ( ). These countries, beginning with U, U, and S in their respective national languages, were unlikely to attach much significance to the relative alphabetical order of Japan and Korea. Right. And what about the Chinese who, back in the 1950's, decided use digraph zh for the first consonant of the country's name? It seems that they didn't care too much that Zhongguo would have been last, after Zimbabwe. On the other hand, both Choson and Han'guk come before Nippon so, if the goal is to have the Koreas listed before Japan on the score boards of the Olympic games, it is enough to use the local names. But, BTW, aren't score boards sorted by score rather than by name? So, an even simpler solution is... winning more medals. _ Marco
[OT?] The C standard library and UTF's (was RE: Text Editors and Canonical Equivalence (was Coloured diacritics))
Tim Greenwood wrote: In my interpretation of the C standard (which I am reading from http://std.dkuug.dk/JTC1/SC22/WG14/www/docs/n843.pdf) UTF-8 is not a valid wchar_t encoding if your execution character set contains characters outside the C0 controls and Basic Latin range, and UTF-16 is not a valid wchar_t encoding if your execution character set has characters outside the BMP. In other words whatever you consider to be a character (which may be a combining character) must be encoded in one wchar_t code unit. The relevant passage is 11 A wide character constant has type wchar_t, an integer type defined in the stddef.h header. The value of a wide character constant containing a single multibyte character that maps to a member of the extended execution character set is the wide character (code) corresponding to that multibyte character, as defined by the mbtowc function, with an implementation-defined current locale. The value of a wide character constant containing more than one multibyte character, or containing a multibyte character or escape sequence not represented in the extended execution character set, is implementation-defined. I don't know. I thought a bit about this, and I think that your restrictive interpretation is not necessarily correct. After all, the C Standard just says is that a wide character and a multibyte character is whatever the mbtowc function defines them to be. And it is quite easy to show that the mbtowc function could, in turn, define them to be whatever the mbrtowc function defines them to be: // My hypothetical mbtowc.c #include wchar.h // (See ISO/IEC 9899:1999 - 7.20.7.2 The mbtowc function) int mbtowc (wchar_t * pwc, const char * s, size_t n) { int retval; static mbstate_t internal; if (s == NULL) { // yes: we are stateful (or pretend we are) return 1; } retval = (int)mbrtowc(pwc, s, n, internal); if (retval 0) { retval = -1; } return retval; } As the definition of multibyte characters and wide character is now completely up to the mbrtowc, we could well adopt the convention (or call it trick, if you prefer) of pretending that a 4-byte UTF-8 multibyte sequence is actually a sequence of *two* 2-byte multibyte sequences. Technically, the trick is possible because: a) returning 2 twice instead than 4 once guarantees the correct advance while scanning a string; b) we can actually map both our fake 2-byte multibyte sequences to an actual wide character: the high and low surrogates; c) the mbstate_t object can be used to store the relevant data across the two calls. Legally, the trick is possible because of the purposely vague wording of the C Standard, which leaves the definition of wide and multibyte characters completely up to the implementation. Here is what I mean: // Excerpt from my hypothetical wchar.h for UTF-16 wide characters // ... // (See ISO/IEC 9899:1999 - 7.17 Common definitions stddef.h) typedef short wchar_t; // ... // (See ISO/IEC 9899:1999 - 7.24 Extended multibyte and wide character utilities wchar.h) typedef wchar_t mbstate_t; // ... // My hypothetical mbrtowc.c for UTF-16 wide characters #include wchar.h // (See ISO/IEC 9899:1999 - 7.24.6.3.2 The mbrtowc function) size_t mbrtowc (wchar_t * pwc, const char * s, size_t n, mbstate_t * ps) { extern int _MyDecodeUtf8 (const char * s, size_t n, long * c32); extern void _MyEncodeUtf16 (long c32, wchar_t * hi16, wchar_t * lo16); static mbstate_t internal = 0; long c32; int retval; if (ps == NULL) { ps = internal; } if (s == NULL) { pwc = NULL; s = ; n = 1; } if (*ps != 0) { if (pwc != NULL) { // output second surrogate saved in previous call *pwc = *ps; } // clear saved surrogate *ps = 0; // return fake multibyte length return 2; } retval = _MyDecodeUtf8(s, n, c32); if (retval == 4) { // output first surrogate and save second surrogate for next call _MyEncodeUtf16(c32, pwc, ps); // return fake multibyte length retval = 2; } else if (retval = 0 pwc != NULL) { *pwc = (wchar_t)c32; } return retval; } If the above UTF-16 implementation could perhaps look relatively smart, an UTF-8 implementation would definitely look very silly. However, if it we agree that defining what a multibyte character and a wide character are the exclusive task of mbtowc (and hence of mbrtowc), then the below implementation, silly as it is, could well be 100% compliant with C99: // Excerpt from my hypothetical wchar.h for UTF-8 (or DBCS, or SBCS, or any byte-oriented encoding) wide characters // ... // (See ISO/IEC 9899:1999 -
RE: Text Editors and Canonical Equivalence (was Coloured diacriti cs)
Hmm. Now here's some C++ source code (syntax colored as Philippe suggests, to imply that the text editor understands C++ at least well :enough to color it) int n = wcslen(Lcafé); (That's int n = wcslen(Lcafé); for those without HTML email) The L prefix on a string literal makes it a wide-character string, and wcslen() is simply a wide-character version of strlen(). (There is no guarantee that wide character means Unicode character, but let's just assume that it does, for the moment). Even assuming that you can assume that wide characters are Unicode, you have not yet assumed in what kind of UTF they are. (Don't assume I deliberately making calembours :-) The only thing that the C(++) standards say about type wchar_t is that it is not smaller that type char, so a wide character could well be a byte, and a wide character string could well be UTF-8, or even ASCII. So, should n equal four or five? Why not six? If, in our C(++) compiler, type wchar_t is an alias for char, and wide character strings are encoded in UTF-8, and the é is decomposed, then n will be equal to 6. The answer would appear to depend on whether or not the source file was saved in NFC or NFD format. The answer is: int n = wcslen(Lcafé); That's why you take the burden to call the wcslen library function rather than assuming a hard-coded value such as: int n = 4; // the length of string café There is more to consider than just how and whether a text editor normalizes. Whatever the editor does, what if then the *compiler* normalizes it? The source file and the compiled object file are not necessarily in the same encoding and/or normalization. A certain compiler could accept a certain range of input encodings (maybe declared with command-line parameter) and convert them all in a certain internal representation in the compiler object file (e.g., Unicode expressed in a particular UTF and with a particular normalization). That's why library functions such as strlen or wcslen exist. You don't need to bother what these functions will return in a particular compiler or environment, as far as the following code is guaranteed to work: const wchar_t * myText = Lcafé; wchar_t * myBuffer = malloc(sizeof(wchar_t) * (wcslen(myText) + 1)); if (myBuffer != NULL) { wcscpy(myBuffer, myText); } If a text editor is capable of dealing with Unicode text, perhaps it should also be able to explicitly DISPLAY the actual composition form of every glyph. Against, this is not possible nor desirable, because a text editor is not supposed to know how the compiler (or its runtime libraries) will transform string literals. The question I posed in the previous paragraph should ideally be obvious by sight - if you see four characters, there are four characters; if you see five characters, there are five characters. Provided that you can define what a character is... After a few years reading this mailing list, I haven't seen a single acceptable definition of character. Moreover, I matured the impression that it is totally irrelevant to have such a definition: - as an end user, I am interested in a higher level kind of objects (let's call them graphemes, i.e. those things I see on the screen and I can interact with my mouse); - as a programmer, I am interested in a lower lever kind of objects (let's call them encoding units, i.e. those things that I count when I have to allocate memory for a string, or the like). The term character is in a sort of conceptual limbo which makes it pretty useless for everybody, IMHO. _ Marco
RE: Text Editors and Canonical Equivalence (was Coloured diacriti cs)
I (Marco Cimarosti) wrote: So, should n equal four or five? Why not six? ^^^ Errata: seven. If, in our C(++) compiler, type wchar_t is an alias for char, and wide character strings are encoded in UTF-8, and the é is decomposed, then n will be equal to 6. ^ Errata: 7. Sorry. _ Marco
RE: Text Editors and Canonical Equivalence (was Coloured diacriti cs)
Peter Kirk wrote: So, should n equal four or five? The answer would appear to depend on whether or not the source file was saved in NFC or NFD format. No, surely not. If the wcslen() function is fully Unicode conformant, it should give the same output whatever the canonically equivalent form of its input. That more or less implies that it should normalise its input. Standards and fantasy are both good things, provided you don't mix them up. The wcslen has nothing whatsoever to do with the Unicode standard, but it has all to do with the *C* standard. And, according to the C standard, wcslen must simply count the number wchar_t array elements from the location pointed to by its argument up to the first wchar_t element whose value is L'\0'. Full stop. (One can imagine a second parameter specifying whether NFC or NFD is required.) One can imagine whatever (s)he wants, but should please avoid to claim that his/her imagination corresponds to some existing standards. This makes the issue one not for the text editor but for the programming language or its string handling library. This is correct. The Unicode standard does allow for special display modes in which the exact underlying string, including control characters, is made visible. Can you please cite the passage where the Unicode standard would not allow this? _ Marco
RE: [OT]
[...] some greedy investors turned it into a scam just for a quick buck (for surely it will be quick!) Sorry, I had to get that off my chest. Hopefully someone with some pull in Ireland will read this and do something about it :-) Or simply flush Guinne$$ and drink Murphix. :-) Ciao. Marco
[OT] GB 18030 certification
I was wondering: what exactly does GB-18030 certification consists of? I guess that some tests done on the software, but what exactly? Also, where and who performs this certification? Does the Chinese government do it directly, or is it out-sourced to external agencies? Does this have to be in China or are offices available also abroad? Thanks in advance for any info. Please reply off-line if you think the matter is not of general interest for the list. _ Marco
RE: Problems encoding the spanish o
Pim Blokland wrote: Not only that, but the process making the mistake of thinking it is UTF-8 also makes the mistake of not generating an error for encountering malformed byte sequences, BTW, this process has a name: Internet Explorer. AND of outputting the result as two 16-bit numbers instead of one 21-bit number. I guess that this resulted by copying pasting the resulting text in an editor and saving it as UTF-16. _ Marco
RE: Tamil conjunct consonants (was: Encoding Tamil SRI)
Peter Jacobi wrote: IMHO this doesn't fit well actual Tamil use and raises a lot of practical problems. Either there must be an accepted list of these ligatures (but lists of archaic usage tend to grow), or one is bound to put a preemptive ZWNJ after every SHA VIRAMA in modern use, to prevent conjunct consonant forming. If this archaic ligature problems extends to other grantha consonants, even more preemptive ZWNJs are necessary for contempary Tamil. Archaic ligatures are supposed to be present only in a font designed for reproducing an archaic look. Those fonts should not be used for typesetting modern Tamil. There is nothing special with Tamil here: this would be true for any other script. E.g., if you typeset this English e-mail with a Fraktur OpenType font many archaic ligatures might appear, such as ch or ss. Moreover, unexpected contextual forms could appear: e.g., the s in special could look very different from the s in ligatures (long s vs. short s). ZWNJ's etc. should be inserted only in special cases, e.g. when the presence or absence of a ligature would change the meaning of the word, or anyway affect the meaning of the text. _ Marco
RE: Encoding Tamil SRI
Peter Constable wrote: Alternatives given were (0BB8)(0BCD)(0BB1)(0BC0) (0BB6)(0BCD)(0BB1)(0BC0) (if and when U+0BB6 becomes Unicode) (0B9A)(0BBF)(0BB1)(0BC0) Alternatives to what? The first and third sequence would have distinct appearances (see attached file), and would consistute distinct spellings. The second cannot be evaluated without knowing what they intend 0BB6 to be. U+0BB6 = TAMIL LETTER SHA (see http://www.unicode.org/alloc/Pipeline.html). _ Marco
Re-distributing the files in http://www.unicode.org/Public/MAPPIN GS/VENDORS
Is it allowed to re-distribute, in a commercial application, the Microsoft and Apple mapping files available from the Unicode server? I am talking about the files published in the following directories: http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT http://www.unicode.org/Public/MAPPINGS/VENDORS/APPLE The header comment of mapping files in other directories under http://www.unicode.org/Public/MAPPINGS often contain explicit instruction for re-distribution, such as: Unicode, Inc. hereby grants the right to [...] make copies of this file in any form for internal or external distribution as long as this notice remains attached; Unicode, Inc. specifically excludes the right to re-distribute this file directly to third parties or other organizations whether for profit or not. It would be nice to have similar statements also from Microsoft and Apple, either in the header of the mapping files themselves or as separate read-me files in the relevant directories. If this is not possible, could I please have a public or private reply from the contact e-mails indicated in the files ([EMAIL PROTECTED] and [EMAIL PROTECTED], respectively), or from the Unicode Consortium official, stating whether o not I am allowed to re-distribute the above described files in a commercial application? Thank you in advance. Regards. Marco Cimarosti (S3, Italy, http://www.essetre.it)
RE: GDP by language
Mark Davis wrote: Marco, I certainly wouldn't draw that conclusion. This is not the appropriate forum for a political or ethical discussion, Of course. I just noticed that those numbers reflect a sad fact of life: that rich people get more than poor people. As this fact is so obvious to anyone, I thought that my remark would not have caused a long discussion. but equating GDP with more important in any general sense is clearly a huge leap, and one that I certainly would not make. But there certainly is a correlation between GDP and what people can buy, including software. The goal of the chart was different. Many people mistakenly think the potential customer base of non-English-speakers is smaller than it actually is. Ah, I didn't imagine it from this point of view. For people who live in non-English-speaking countries, it is easier to remember that English is not the only language in the world. I thought the chart was intended as a rationale for prioritizing the support of languages in consideration of the profitability of the corresponding markets: 1. support for Western languages is priority one, as it corresponds to the largest slice of market; 2. CJK support comes immediately after, as it corresponds the second largest market; 3. then comes Bidi support, which corresponds to a smaller but still interesting market. 4. Indic support can wait, as the corresponding market is less profitable. This is, IMHO, how people paying our salaries would read the chart. I am not even blaming them, as that is probably the correct reading, by the point of view of business. _ Marco
RE: GDP by language
Mark Davis wrote: BTW, some time ago I had generated a pie chart of world GDP divided up by language. Those quotients are immoral. Of course, this immorality is not the fault of he who did the calculation: the immorality is out there, and those infamous numbers are just an arithmetical expressions of it. In practice, those quotients say that, e.g., Italian (spoken by 50 millions people or less) is more important than Hindi (spoken by nearly one billion people), just because an average Italian is richer than an average Indian. In other terms, each Indian (or any other citizen from poor countries) has 1/20 or less of the linguistic rights of an Italian (or any other citizen from rich countries). BTW, by summing up languages written with the same script, it is easy to derive the immoral quotients of writing systems: Latin 59.13% Han 20.60% Arabic 3.82% Cyrillic 2.99% Devanagari 2.54% Hangul 1.84% Thai 0.87% Bengali 0.44% Telugu 0.42% Greek0.40% Tamil0.34% Gujarati 0.26% _ Marco
RE: Line Separator and Paragraph Separator
Jill Ramonsky wrote: [...] I've even invented (and used) some 8-bit encodings which leave the whole of Latin-1 unchanged (apart from the C1s) and use C1 characters a bit like surrogate pairs to reach the rest. Doug, are you listening? It seems there's a new clone of UTF:-)Z waiting for implementation! _ Marco
RE: Swahili Banthu
Peter Kirk wrote: Are we talking about a real non-Latin script, some kind of syllabary or logographic script, for Swahili and other Bantu languages? [...] Or did someone not notice that Marco's comments were about the word joke? Indeed. In the last few months, I have been relatively serious, so someone may not know or remember that I am the unofficial Unicode List's clown. *|:o) _ Marco
RE: Swahili Banthu
Philippe Verdy wrote: As Africa has been influenced by many foreign invasions, there may in fact exist other scripts to represent this language [...] Yes: until a recent past, Swahili was also commonly written in the Arabic alphabet. _ Marco
RE: PUA
Chris Jacobs wrote: [...] Nevertheless I think if Unicode don't want to decide how the PUA is to be interpreted Please take notice of this interpreted: I'll come back to this soon. it should be at the very least provide a mechanism by which an user of the PUA can specify which specification he prefers. I plan to propose such a mechanism: I want to propose a char with the following properties: Scalar Value: U+E0002 This starts a PUA interpretation Again, please take notice of this interpretation. selector tag. The content of the tag is a Font family name. For all PUA chars between this tag and the corresponding Cancel tag the copyright holder of the font is the sole authority about how the PUA should be interpreted. Again, interpreted... Any comments? Yes. A font tells me how a certain run of text should be *displayed* in rich text, not how it should be *interpreted* in plain text. Imagine that I have been asked to write a function AreTheseLetters() which gets a string argument (i.e., a piece of plain text) and returns a Boolean value indicating whether all the characters in it are letters. For non-PUA characters, I already implemented this using Unicode's General Category property: I decided that all characters whose General Category is L* are letters. My default assumption about PUA characters is that they are not letters. So far so good. Now I want to use your PUA Plan-14 tags, if present, to override the above assumption about PUA characters. E.g., imagine that my string contains this: (U+0E U+0E0002 U+0E0046 U+0E006F U+0E004F U+0E0062 U+0E0061 U+0E0072 U+0E002E U+0E0074 U+0E0074 U+0E0066 U+0E007F U+E017 U+E009) This is what I am going to do: 1) I parsing the tags at the beginning of the string and save the relevant information in a temporary variable which we will call PuaInterpretation; 2) I remove the tags. Now, my PuaInterpretation variable contains the following information: Foobar.ttf And my string contains the following text: (U+E017 U+E009) Now, what's the next step? What am I supposed to do to find out whether, according to the PUA interpretation called Foobar.ttf, U+E017 and U+E009 are letters or not? _ Marco
RE: Klingons and their allies - Beyond 17 planes
John Cowan wrote: You persist in misunderstanding. Suppose I came along and told you I wanted to create a Unicode codepoint for each word in every language on Earth. Would you blithely allocate me a 24-billion-codepoint private space? Why? 200 millions should be more than enough: that's more than 30.000 words for each living language. Of course, you should only encode abstract words, such as ENGLISH VERB JOKE, and combining morphemes such as ENGLISH COMBINING INFLECTION PAST TENSE, ENGLISH COMBINING INFLECTION PRESENT PARTICIPLE, etc. It will be the task of the uttering engine to utter a sequence like ENGLISH VERB JOKE + ENGLISH COMBINING INFLECTION PRESENT PARTICIPLE with the ligature joking. Of course, this will only happen with OpenLex-enabled uttering engines: naive uttering engine based on old TrueLex would render with the fallback uttering joke -ing. To make it more interesting, you could also encode a few useless compatibility presentation inflected forms such ENGLISH VERB SPEAK PAST TENSE FORM, which will get decomposed to ENGLISH VERB SPEAK + ENGLISH COMBINING INFLECTION PAST TENSE, and finally be rendered as spoke, speaked or speak -ed, depending on the platform. Notice that a few words will need contextual forms, such as ENGLISH INDETERMINATE ARTICLE, which will display as a or an depending on the following code point. Languages, such as Swahili, which use prefixes instead than suffixes will be encoded in logical order, i.e. with the combining prefix after the root. It will be the task of the uttering engine to reorder the prefix. E.g., the Swahili word watu (plural of mtu = man) will be encoded as SWAHILI NOUN TU + SWAHILI COMBINING INFLECTION PLURAL FOR PEOPLE and, in theory, it will be rendered as watu. In practice, it will always be rendered as -tu wa- because no one will invest in implementing Swahili rendering. Ciao. Marco
RE: Canonical equivalence in rendering: mandatory or recommended?
Jill Ramonsky wrote: In my experience, there is a performance hit. I had to write an API for my employer last year to handle some aspects of Unicode. We normalised everything to NFD, not NFC (but that's easier, not harder). Nonetheless, all the string handling routines were not allowed to assume that the input was in NFD, but they had to guarantee that the output was. These routines, therefore, had to do a convert to NFD on every input, even if the input were already in NFD. This did have a significant performance hit, since we were handling (Unicode) strings throughout the app. I think that next time I write a similar API, I wll deal with (string+bool) pairs, instead of plain strings, with the bool meaning already normalised. This would definitely speed things up. Of course, for any strings coming in from outside, I'd still have to assume they were not normalised, just in case. You could have split the NFD process in two separate steps: 1) Decomposition per se; 2) Reordering of combining classes. You could have performed step 1 (which is presumably much heavier than 2) only on strings coming from outside, and step 2 at every passage. In a further enhancement, step 2 could be called only upon operations which could produce non-canonical order: e.g. when concatenating strings but not when trimming them. To gain even more speed, you could implement an ad-hoc version of step 2 which only operates on out-of order characters adjacent to a specified location in the string (e.g., the joining point of a concatenation operation). Just my 0.02 euros. _ Marco
RE: Bangla: [ZWJ], [VIRAMA] and CV sequences
Gautam Sengupta wrote: --- Marco Cimarosti wrote: OK but, then, your ZWJ becomes exactly what Unicode's VIRAMA has always been: [...] You are absolutely right. I am suggesting that the language-specific viramas be retained as script-specific *explicit* viramas that never disappear. In addition, let's have a script-specific ZWJ which behaves in the way you describe in the preceding paragraph. Good, good. We are making small steps forward. What are you really asking for is that each Indic script has *two* viramas: - a soft virama, which is normally invisible and only displays visibly in special cases (no ligatures for that cluster); - a hard virama (or explicit virama, as you correctly called it), which always displays as such and never ligates with adjacent characters. Let's assume that it would be handy to assign these two viramas to different keys on the keyboard. Or, even better, let's assign the soft virama to the plain key and the hard virama to the SHIFT key, OK? To avoid misunderstandings with the term virama, let's label this key JOINER. Now, this is what you *already* have in Unicode! On our hypothetic Bangla keyboard: - the soft virama (the plain JOINER key) is Unicode's BENGALI SIGN VIRAMA; - the hard virama (the SHIFT+JOINER key) is Unicode's BENGALI SIGN VIRAMA+ZWNJ. Not only Unicode allows all of the above, but it also has a third kind of virama, which may or may not be useful in Bangla but is certainly useful in Devanagari and Gujarati: - the half-consonant virama (let's assign it to the ALT+JOINER key in out hypothetical keyboard) which forces the preceding consonant to be displayed as an half consonant, if possible. This is Unicode's BENGALI SIGN VIRAMA+ZWJ. Notice that, once you have these three viramas on your keyboard, you don't need to have keys for ZWJ and ZWNJ, as their only use, in Indic, is after a xxx SIGN VIRAMA. Apart the fact that two of the three viramas are encoded as a *pair* of code points, how does the *current* Unicode model impede you to implement the clean theoretical model that you have in mind? [...] - independent and dependent vowels were the same characters; [...] I agree with you on all of these issues. You have in fact summed up my critique of the ISCII/Unicode model. OK. But are you sure that this critique should necessarily be moved to the *encoding* model, rather than to some other part of the chain. I'll now try to demonstrate how also the redundancy of dependent/independent vowels may be solved at the *keyboard* level. You are certainly aware that some national keyboards have the so-called dead keys. A dead key is a key which does not immediately send (a) character(s) to the application but waits for a second key; in European keyboards dead keys are used to type accented letters. E.g., let's see how accented letters are typed on the Spanish keyboard (which, BTW, is by far the best designed keyboard in Western Europe): 1. If you press the ´ key, nothing is sent to the application, but the keystroke is memorized by the keyboard driver. 2. If you now press one of a, e, i, o, u or y keys, characters á, é, í, ó, ú or ý are sent to the application. 3. If you press the space bar, character ´ itself is sent to the application; 4. If you press any other key, e.g. m, the two characters ´ and m are sent to the application in this order. Now, in the description above substitute: - the ´ key with 0985 BENGALI LETTER A (but let's label it VIRTUAL CONSONANT); - the a ... y keys with 09BE BENGALI VOWEL SIGN AA ... 09CC BENGALI VOWEL SIGN AU; - the á ... ý characters with 0986 BENGALI LETTER AA ... 0994 BEGALI LETTER AU. What you have is a Bangla keyboard where dependent vowels are typed with a single vowel keystroke, and independent vowels are typed with the sequence VIRTUAL CONSONANT+vowel. Do you prefer your cons+VIRAMA+vowel model? Personally, I find it is suboptimal, as it requires, on average, more keystrokes. However, if that's what you want, in the Spanish keyboard description above substitute: - the ´ key with the unshifted JOINER (= virama) key that we have already defined above; - the a ... y keys with 0986 BENGALI LETTER AA ... 0994 BEGALI LETTER AU; - the á ... ý characters with 09BE BENGALI VOWEL SIGN AA ... 09CC BENGALI VOWEL SIGN AU. Now you have a Bangla keyboard where independent vowels are typed with a single keystroke, and dependent vowels are typed with the sequence JOINER+vowel. _ Marco
RE: Bangla: [ZWJ], [VIRAMA] and CV sequences
Gautam Sengupta wrote: Is there any reason (apart from trying to be ISCII-conformant) why the Bangla word /ki/ what cannot be encoded as [KA][ZWJ][I]? Do we really need combining forms of vowels to encode Indian scripts? Perhaps you are right that it *would* have been a cleaner design to have only one set of vowel. But notice that KA+I is one character longer that KA+I. Maybe storage space is not a big problem these days, but still it makes 2 to 4 extra bytes for each consonant not followed by the inherent vowel /a/. Perhaps it *would* have been better to have only the combining vowels, and to form independent vowels with a mute consonant (actually, the independent vowel a). Also, why not use [CONS][ZWJ][CONS] instead of [CONS][VIRAMA][CONS]? One could then use [VIRAMA] only where it is explicit/visible. OK. But what happens when the font does not have a glyph for the ligature consZWJcons, nor for the half consonant consZWJ, nor for the subjoined consonant ZWJcons? As ZWJ, per se, is an invisible character, what happens is that your string displays as conscons, which is clearly semantically incorrect. If you want the explicit virama to be visible, you need to encode it as consVIRAMAcons. And this means that you (the author of the text) are forced to chose between ZWJ and VIRAMA based on the availability of glyphs in the *particular* font that you are using while typing. And this is a big no no no, because it would impede you to change the font without re-typing part of the text. What happens with the current Unicode scheme is that, if the font does not have a glyph for the ligature consVIRAMAcons, nor for the half consonant consVIRAMA, nor for the subjoined consonant VIRAMAcons, the virama is *automatically* displayed visibly, so that the semantics of the text is always safe, even if rendered with the most stupid of fonts. Surely, [A/E][ZWJ][Y][ZWJ][AA] is more natural and intuitively acceptable than any encoding in which a vowel is followed by a [VIRAMA]? Maybe. But I see no reason why being natural or intuitive should be seen as key feature for an encoding system. That might be the case for an encoding system designed to be used by humans, but Unicode is designed to be used by computers, so I don't see the problem. I assume that in a well designed Bengali input method, yaphala would be a key on its own, so, by the point of view of the user, it is just a character: they don't need to know that when they press that key the sequence of codes VIRAMAYA will actually be inserted, so they won't notice the apparent nonsense of the sequence vowelVIRAMA and, as we say in Italy, If eye doesn't see, heart doesn't hurt. _ Marco
RE: Bangla: [ZWJ], [VIRAMA] and CV sequences
Peter Kirk wrote: I don't understand the specific issues here... But it does seem a rather strange design principle that we should expect a text to be displayed meaningfully even when the font lacks the glyphs required for proper display. The fact is that these glyphs are not necessarily *required*. Each Indic script has a relatively small set of glyphs that are absolutely required in any font, but also an unspecified number of ligature that may or may not be present. This may depend from the language (e.g., Devanagari for Sanskrit typically uses more ligatures than Devanagari for Hindi), or even simply be a matter of typographical style. _ Marco
RE: Bangla: [ZWJ], [VIRAMA] and CV sequences
Gautam Sengupta wrote: I am no programmer, but surely the rendering engine could be tweaked to display a halant/hashant in the aforementioned situations? I understand that it won't happen *automatically* if we were to use ZWJ instead of VIRAMA. But if you were to take the trouble to do the tweaking, you'd then have a completely *intuitive* encodings for vowel yaphala sequences, vowelZWJY, instead of oddities like vowelVIRAMAY. OK but, then, your ZWJ becomes exactly what Unicode's VIRAMA has always been: a character that is normally invisible, because it merges in a ligature with adjacent characters, but occasionally becomes visible when a font does not have a glyph for that combination. But there is one detail which makes your approach much more complicated: what we have been calling VIRAMA is *not* a single character. Every Indic script has its own: DEVANAGARI SIGN VIRAMA, BENGALI SIGN VIRAMA, and so on. Each one of these characters, when displayed visibly, has a distinct glyph: a Bangla hashant is a small / under the letter, a Tamil virama is a dot over the letter, etc. With your approach, the single character ZWJ is overloaded with a dozen different glyphs depending on which script the adjacent letters belong to. Plus, it still has to be invisible when used in a non-Indic script, such as Arabic. Implementing all this is certainly possible, but would result in bigger look-up tables, for no advantage at all. Perhaps there isn't a *problem* as such, and perhaps naturalness and intuitive acceptability aren't *key* features of the system, but surely other factors being equal they ought be taken into consideration in choosing one method of encoding over another? Yes. But the flaws that I see in ISCII/Unicode model are much smaller than you imply. E.g., I agree that it would have been more logic if: - independent and dependent vowels were the same characters; - each script was encoded in its natural alphabetical order; - there were no precomposed and decomposed alternatives for the same graphemes. And others, on which perhaps a linguist won't agree, but which would have made life much easier to programmers: - all vowels were encoded in visual order, so that vowel reordering was necessary; - repha ra were encoded as a separate characters, so that no reordering at all was necessary. But, all summed up, leaving with these little flaws is *much* simpler than trying to change the rules of a standard a dozen years after people started implementing it. _ Marco
Bogus UTF's are back! :-) (was RE: Non-ascii string processing?)
Doug Ewell wrote: [...] we'd all use UTF-336. Er? If only I had a bit more spare time, Jill. You do NOT want to get me started... :-) Go for it, Doug! :-) If I only had a bit of spare time myself, I'd be eager of running bits-per-character statistics for UTF:-)336 in various languages... Something makes me think that it would rank even worse than UTF-32 and UTF:-)64. BTW, how about the convention of using the UTF:-) prefix for our bogus/model/parody UTF's? _ Marco
RE: Unicode Public Review Issues update
Jony Rosenne wrote: I don't remember whether Hebrew Braille is written RTL or LTR. Braille is always LTR, even for Hebrew and Arabic. To be more precise, Braille is always LTR when you read it, but RTL when you write it manually (because it is engraved on the back side of the paper, using a dotted rule and a stylus). _ Marco
Braille is not bidi neutral! (was RE: Unicode Public Review Issue s update)
I (Marco Cimarosti) wrote: Jony Rosenne wrote: I don't remember whether Hebrew Braille is written RTL or LTR. Braille is always LTR, even for Hebrew and Arabic. Hwæt! I noticed only now that the Bidirectional Category of braille characters is ON - Other neutrals! AFAIK, that is completely broken! A run of braille text *must* remain LTR in a bidi context, because braille is never RTL. This ON bidi category makes it impossible to correctly encode, e.g., a manual of Braille in Hebrew or Arabic, because the braille runs would get swapped by the Bidirectional Algorithm. _ Marco
RE: What things are called (was Non-ascii string processing)
Jill Ramonsky wrote: Hey - the public will just have to get used to it! No, the public should not be bored with these technical details: in the user manual, a book will still be a book. The fact that, in the source code of the application book means something else if of interest only to programmers. _ Marco
RE: Non-ascii string processing?
Peter Kirk wrote: For i% = 1 to Len(utf8string$) c$ = Mid(utf8string$, i%, 1) Process c$ Next i% Such a loop would be more efficient in UTF-32 of course, but this is still a real need for working with character counts. If the string type and function of this Basic dialect is not Unicode-aware, then: - Len(s$) returns the number of *bytes* in the string; - Mid(s$, i%, 1) returns a single *byte*; - Your Process() subroutine won't work... If the string type and functions are Unicode aware (as, e.g., in Visual Basic or VBScript), then I'd expect that the actual internal representation is hidden from the programmer, hence it makes no sense to talk about an UTF-8 string. _ Marco
RE: Non-ascii string processing?
Elliotte Rusty Harold wrote: A W3C XML Schema Language validator needs a character based API to correctly implement the minLength and maxLength facets on xsd:string As far as I understand, xsd:string is a list of Character-s, and a Character is an integer which can hold any valid Unicode code point. In other terms, xsd:string is necessarily in UTF-32 (or something close to it): it cannot be in UTF-8 or UTF-16. The numbers returned by length, minLength and maxLength are the actual, minimum and maximum number of *list elements*, contained in the list. I.e., in the case of xsd:string, the *size* of the string in *encoding units*. The fact that, in UTF-32, the *size* of the sting in encoding units corresponds to the number of characters is coincidental. In any case, the useful information is always the *size* of the string in encoding units (octets for UTF-8, 16-bit units for UTF-16, etc.), not the number of characters it contains. _ Marco
RE: Non-ascii string processing?
Doug Ewell wrote: Depends on what processing you are talking about. Just to cite the most obvious case, passing a non-ASCII, UTF-8 string to byte-oriented strlen() will fail dramatically. Why? The purpose of strlen() is counting the number of *bytes* needed to store a certain string, and this works just as fine for UTF-8 as it does for SBCS's or DBCS's. What strlen() cannot do is countng the number of *characters* in a string. But who cares? I can imagine very few situations where someone such an information would be useful. _ Marco
RE: Non-ascii string processing?
Theodore H. Smith wrote: Hi lists, Hi, member. I'm wondering how people tend to do their non-ascii string processing. I think no one has been doing ASCII string processing for decades. :-) But I guess you meant non-SBCS (single byte character set) string processing. [...] So, I'm wondering, in fact, is there ANY code that needs explicit UTF8 processing? Heres a few I've thought of. In general, you need to UTF-32 whenever you need to access the single *characters* in a string. This is needed for all kinds of lexical or typographic processing, e.g.: - case matching or conversion ( vs. ); - loose matching ( vs. a); - displaying the text; Can anyone tell me any more? Please feel free to go into great detail in your answers. The more detail the better. There is at least one case in which you need UTF-8-aware code even if not accessing single characters: it is when you *trim* a string at an arbitrary byte position. E.g.: char str1 [9] = abc; char * str2 = ; strncat(str1, str2, sizeof(str1)); If strncat() is UTF-8 aware: str1 will be abc + null terminator (8 bytes). But if strncat() is *not* UTF-8 aware, str1 will contain an invalid UTF-8 string: abc + an *llegal* byte (0xCE) + null terminator. _ Marco
RE: Non-ascii string processing?
Stephane Bortzmeyer wrote: On Mon, Oct 06, 2003 at 12:09:34PM +0200, Marco Cimarosti [EMAIL PROTECTED] wrote a message of 14 lines which said: What strlen() cannot do is countîng the number of *characters* in a string. But who cares? I can imagine very few situations where someone such an information would be useful. It is one thing to explain that strlen() has byte semantics and not character semantics. It is another to assume that character semantics are useless. I never said that character semantics are useless: I said that it is almost always useless to count the *number* of Unicode characters in a string. One of the few cases in which such a count could be useful is to pre-allocate a buffer for an UTF-8 to UTF-32 conversion. But there is no need of a general purpose API function for such a special need. Most text-processing software allow you to count the number of characters in a document, for instance. Yes. And: 1) That is a very special need of a very special kind of application (a word processor), so it doesn't justify a general purpose API function for that: people don't normally write word processors every day. 2) That count cannot be done by counting Unicode characters (i.e., encoding units): you have to count the object that the user perceives as typographical characters. E.g., control or formatting characters should be ignored, sequences of two or more space characters should be counted as one, and a word like élite is always counted as five characters, regardless that it might be encoded as six Unicode characters. In an Indic or Korean text, each syllable counts as a single character, although it may be encoded as long sequences of Unicode characters. 3) That is a very silly count anyway. If you want to have an idea of the size of a document, lines or words are much more useful units. Any decent Unicode programmaing environment should give you two functions, one for byte semantics and one for character semantics. Both are useful. OK. But the length in characters of a string is not character semantics: it's plain nonsense, IMHO. _ Marco
RE: Non-ascii string processing?
Stephane Bortzmeyer wrote: OK. But the length in characters of a string is not character semantics: it's plain nonsense, IMHO. I disagree. Feel free. But I still don't see any use in knowing how many characters are in an UTF-8 string, apart the use that I already mentioned: allocating a buffer for a UTF-8 to UTF-32 conversion. _ Marco
RE: Non-ascii string processing?
Edward H. Trager wrote: But I still don't see any use in knowing how many characters are in an UTF-8 string, apart the use that I already mentioned: allocating a buffer for a UTF-8 to UTF-32 conversion. Well, I know a good use for it: a console or terminal-based application which displays information using fixed-width fonts in a tabular form, such as a subset of records from a database table. To calculate how wide to display each column, knowing the maximum number of characters in the strings for each column is a useful starting place. Well, I am just about to start a time consuming task: fixing an application which was based on the assumption the number of characters in a string was good starting place to format tabular text in a fixed width font... You have already explained why this can't work when CJK or other scripts pop in. What you really need for such a thing is a function which computes the width of a string in terms of display units, rather than its length in term of characters. _ Marco
RE: FW: Web Form: Other Question: British pound sign - U+00A3
This (Peter's) answer is, in my understanding, the nearest to the truth. He made the same assumption I did: you declared that your file was UTF-8 but actually it wasn't. :-) Here is the problem: How do I make my keyboard which only produces 8-bit [...] The keyboard has nothing to do with it. The problem is how you save the file. You should see if the Save as... command of your text editor (or HTML authoring tool) has an option like Save as UTF-8. If it doesn't, see if your Notepad utility has it option. If it's there (I don't remember in which version of Windows it was added). You just open your file and save it selecting Save as UTF-8. You can also use an utility to convert the character set. E.g., try the iconv.exe utility from libiconv (http://sourceforge.net/project/showfiles.php?group_id=23617). Ciao. Marco
RE: Web Form: Other Question: British pound sign - U+00A3
[EMAIL PROTECTED] wrote (through Magda Danish): [...] Our problem is the representation of the £ sign (British pound sign - U+00A3). When we type this character into our pages and then set the character encoding in our pages to Unicode (UTF-8) (either by setting it directly in the HTTP header, or setting it using the meta http-equiv=Content-Type content=text/html; charset=utf-8 I think it should be charset=UTF-8, in capital letters. I was looking into the IANA charsets today, and I don't remember having seen a lowercase alias for that. tag), when we view the pages we see the standard ASCII set of characters, but the Pound sign displays as an error. The most obvious question is: are your pages *actually* in UTF-8? It is not enough that you *declare* that they are UTF-8 if you didn't actually save them as UTF-8 with your editor. Could you put on line a small test page containing the pound symbol and post the URL? Also which version of Unicode does HTML 4.0 support using escape characters (eg. #163)? It doesn't matter which version of Unicode it is, because the pound symbol is in from day zero. Notice however that HTML character reference must end with a semicolon: #163; _ Marco
RE: Internal Representation of Unicode
[EMAIL PROTECTED] wrote: In a plain text environment, there is often a need to encode more than just the plain character. A console, or terminal emulator, is such an environment. Therefore I propose the following as a technical report for internal encoding of unicode characters; with one goal in mind: character equalence is binary equalence. I guess you meant equivalence. Q1: But what are character equivalence and binary equivalence, and why did you choose them as your goals? I thought of dividing the 64 bit code space into 32 variably wide plains, Q2: What are these plains for? Why are there 32 of them? one for control characters, one for latin characters, one for han characters, Q3: Why do you want to treat Latin character and Han characters differently? There is nothing special with Latin or Han characters in Unicode: they are just 2 of the about 50 scripts currently supported in Unicode. (see http://www.unicode.org/Public/UNIDATA/UnicodeData.txt and http://www.unicode.org/Public/UNIDATA/Scripts.txt) Q4: And how do you plan to distinguish them? Both Latin and Han characters are scattered all over the Unicode space, so you need to check many ranges to determine which character belongs to which category. Q5: And what about all character which are neither Latin nor Han? and so on; using 5 bits and the next 3 fixed to zero (for future expansion and alignment to an octet). I call plain 0 control characters and won't discuss it further. Q6: Why do control characters have a special handling? Q7: Don't control characters have properties attached like any other characters? One example of properties which could be useful to attach to control character is directionality. E.g., a TAB is always a TAB but, after it passed through the Bidirectional Algorithm, its directionality can be resolved to be either LTR or RTL. Plain 1, I had intended for latin characters with the following encoding method in mind: bits 63..59 58..56 55..40 39..32 31..24 23..16 15..8 7..0 +---+--+--+--+--+--+--+--+ | plain | zero | attr | res | uacc | lacc | res | char | +---+--+--+--+--+--+--+--+ * Plain Plain(5 bits) * Zero Zero bits(3 bits) * Attr Attributes (16 bits) Q8: What kind of information are these three fields for? Q9: In case your answer to Q8 is they are application-defined, then what is the rationale for defining and naming more than one field? I mean: if they are application-defined, why not leave the task of defining sub-fields to the application? * Res Reserved (8 bits) * Uacc Upper Accent (8 bits) * Lacc Lower Accent (8 bits) Q10:Why do treat accents specially? They are just characters as any others. In Unicode there is no special limitation as to how many accents can be applied to a base character. There is also no obligation for accents to have a base character. * Res Reserved (8 bits) * Char Character(8 bits) Q11:How can you store a Latin character in 8 bits? Unicode has 938 Latin characters, and their codes range from U+0041 to U+FF5A. All of these fields are actually implementation defined, with just one rule for char: don't include characters that can be made with combinations, that's what the accent fields are for. But characters are non necessarily decomposed in one Latin character with one upper accent and one lower accent. E.g., U+01D5 (LATIN CAPITAL LETTER U WITH DIAERESIS AND MACRON) decomposes to U+0055 U+0308 U+0304 (LATIN CAPITAL LETTER U, COMBINING DIAERESIS, COMBINING MACRON). Both COMBINING DIAERESIS and COMBINING MACRON are upper accents. Q12:How are you going to deal with a combination of, e.g., a base letter + 5 upper accents + 3 lower accents? This allows for 255 upper and lower accents which should be enough -- for now. I counted 129 upper accents. But their codes range from U+0300 to U+1D1AD. Q13:How are you going to compress these codes into 8 bits? Are you planning to use a conversion table from the Unicode code to your internal 8-bit code? For Han characters I thought of the following encoding method (with no particular plain in mind): bits 63..59 58..56 55..40 39..32 31 ..0 +---+--+--+---+--+ | plain | zero | attr | style | char| +---+--+--+---+--+ * Plain Plain(5 bits) * Zero Zero bits(3 bits) * Attr Attributes (16 bits) * Style Stylistic Variation (8 bits) Q14:What kind of information is in field Style? Q15:Why do only Han characters have this? Letters in many other scripts may have stylistic variations. E.g.,
RE: About that alphabetician...
Michael Everson wrote: At 08:33 -0700 2003-09-25, John Hudson wrote: Unicode is an encoding standard for text on computers that allows documents in any script and language to be entered, stored, edited and exchanged. blank stare from layman Unicode is a code in which every letter of every alphabet in the world corresponds to a number. This numeric code is used to write text inside computers, because only can be written numbers inside computers. When the computer shows on the screen the text which it has inside, it draws the letters corresponding to the Unicode numbers which it has inside. My 4 years listened to this explanation and said everything was clear. The only problem is that he now wants to disassemble my computer to see the numbers it has inside. He thinks that the numbers are stored in the form of talking ladybugs which would say out the number when you tip on them (he gained this idea from one of his favorite books: Learn the Numbers with the Talking Ladybugs). _ Marco
RE: [OT?] QBCS
Doug Ewell wrote: [...] (BTW, pet peeve: The word acronym should only be used to mean a pronounceable WORD (nym) formed from the initials of other words. Classic examples are scuba and radar. If you can figure out how to pronounce qbcs, more power to you, but to me it's just an abbreviation.) Right, sorry. (I can pronounce ['kubks], although I wouldn't do it in front of my managers and customers. :-) Actually, I don't like this QBCS term and I'd rather avoid saying it myself. But I wanted to be sure other people mean when they use it. [...] So what it really means must be quadra-byte character encoding, and both GB 18030 and UTF-32 should fit into that category. GB 18030, yes, because its code units vary from one to four bytes in length. UTF-32, no, because its code units are uniformly 32 bits. But UTF-8 fits the definition. _ Marco
[OT?] QBCS
It seems that the IT world has a new acronym: QBCS. I understand that it stands for quadra-byte character set, and I heard it used to refer to GB 13030. My question is: it just a fancy sinomym for GB 13030 or can it also refer to Unicode or other encodings? Thanks in advance. _ Marco
RE: Proposed Draft UTR #31 - Syntax Characters
I posted my feedbacks through the report forms. The text of the two posts is attached. (I considerably shortened the list of non-Latin punctuation marks that I suggest to exclude from identifiers, although I added two of the Hebrew punctuation marks suggested by Kirk.) _ Marco Feedback on UTR#31 (draft 1): Full/Half-Width Characters. I suggest that all compatibility character which are labelled wide, narrow and small and whose compatibility decompositions is already in class Pattern_Syntax be added in class Pattern_Syntax as well. In practice, I am suggesting to add the following lines to section 4.1 Proposed Pattern Properties: FE50..FE52 ; Pattern_Syntax # SMALL COMMA..SMALL FULL STOP FE54..FE57 ; Pattern_Syntax # SMALL SEMICOLON..SMALL EXCLAMATION MARK FE59..FE66 ; Pattern_Syntax # SMALL LEFT PARENTHESIS..SMALL EQUALS SIGN FE68..FE6B ; Pattern_Syntax # SMALL REVERSE SOLIDUS..SMALL COMMERCIAL AT FF01..FF0F ; Pattern_Syntax # FULLWIDTH EXCLAMATION MARK..FULLWIDTH SOLIDUS FF1A..FF20 ; Pattern_Syntax # FULLWIDTH COLON..FULLWIDTH COMMERCIAL AT FF3B..FF40 ; Pattern_Syntax # FULLWIDTH LEFT SQUARE BRACKET..FULLWIDTH GRAVE ACCENT FF5B..FF5E ; Pattern_Syntax # FULLWIDTH LEFT CURLY BRACKET..FULLWIDTH TILDE FF5F..FF61 ; Pattern_Syntax # FULLWIDTH LEFT WHITE PARENTHESIS..HALFWIDTH IDEOGRAPHIC FULL STOP FF64 ; Pattern_Syntax # HALFWIDTH IDEOGRAPHIC COMMA FFE0..FFE2 ; Pattern_Syntax # FULLWIDTH CENT SIGN..FULLWIDTH NOT SIGN FFE4..FFE5 ; Pattern_Syntax # FULLWIDTH BROKEN BAR..FULLWIDTH YEN SIGN FFE8..FFEE ; Pattern_Syntax # HALFWIDTH FORMS LIGHT VERTICAL..HALFWIDTH WHITE CIRCLE Rationale. These characters are almost identical, visually and semantically, to their normal width counterparts. Allowing such characters in identifiers means allowing identifiers which look identical to expressions of a totally different kind. E.g., an identifier such as foo,bar (where , is U+FF0C FULLWIDTH COMMA), would look identical to expression foo, bar (identifier foo + comma + space + identifier bar). Regards. Marco Cimarosti ([EMAIL PROTECTED]) Feedback on UTR#31 (draft 1): Non-Latin Punctuation. I suggest that a small set of non-Latin punctuation marks be added in class Pattern_Syntax. Each one of the punctuation marks that I am suggesting to include complies with the following conditions: 1) It is very similar in shape to an ASCII-range character which is already in class Pattern_Syntax; 2) It is very similar in function to an ASCII-range character already which is in class Pattern_Syntax; 3) It is used in the modern orthography of modern languages and/or it is commonly available on national keyboards; 4) It is not commonly used to form words or phrases which may be used as identifiers. In practice, I am suggesting to add the following lines to section 4.1 Proposed Pattern Properties: 037E ; Pattern_Syntax # GREEK QUESTION MARK 0387 ; Pattern_Syntax # GREEK ANO TELEIA 055C..055E ; Pattern_Syntax # ARMENIAN EXCLAMATION MARK..ARMENIAN QUESTION MARK 0589 ; Pattern_Syntax # ARMENIAN FULL STOP 05C0 ; Pattern_Syntax # HEBREW PUNCTUATION PASEQ 05C3 ; Pattern_Syntax # HEBREW PUNCTUATION SOF PASUQ 060C..060D ; Pattern_Syntax # ARABIC COMMA..ARABIC DATE SEPARATOR 061B ; Pattern_Syntax # ARABIC SEMICOLON 061F ; Pattern_Syntax # ARABIC QUESTION MARK 066A..066C ; Pattern_Syntax # ARABIC PERCENT SIGN..ARABIC THOUSANDS SEPARATOR 06D4 ; Pattern_Syntax # ARABIC FULL STOP 066D ; Pattern_Syntax # ARABIC FIVE POINTED STAR 0964..0965 ; Pattern_Syntax # DEVANAGARI DANDA..DEVANAGARI DOUBLE DANDA 10FB ; Pattern_Syntax # GEORGIAN PARAGRAPH SEPARATOR 1362..1368 ; Pattern_Syntax # ETHIOPIC FULL STOP..ETHIOPIC PARAGRAPH SEPARATOR Rationale. Punctuation marks complying with conditions #1 to #3 may easily be cofused with ASCII-range characters which are normally used in the syntax of computer languages and notations. Allowing such character in identifiers would mean to allow identifiers which look almost identical to expressions of a totally different kind. E.g., an identifier such as return; (where ; is U+037E GREEK QUESTION MARK), looks identical to expression return; (identifier or keyword return + semicolon). However, punctuation marks mentioned in condition #4 (e.g. syllable separators, morpheme separators, abbreviation marks, diacritic marks, apostrophes) are excluded from my suggestion (i.e. I suggest to allow them in identifiers) because they are useful to form words or phrases which may act as identifiers. Character-by-character rationale. In the following list, I listed each suggested character along with the ASCII-range character which looks similar to it (as per condition #1 above) and with the ASCII-range character
[OT?] ICU training offerings anyone?
Dear Unicoders, Does any company offer training on ICU programming? I am more interested in courses located in Europe, but I'd also be glad to know about courses in North America or elsewhere. If you feel that this information is not appropriate for the public list, please feel free to reply privately. Thank you in advance. Regards. _ Marco
RE: Proposed Draft UTR #31 - Syntax Characters
Peter Kirk wrote: Similarly, Hebrew geresh and gershayim look like quotation marks and are used interchangeably in legacy encodings, the same with maqaf and hyphen - maqaf is very much the cultural equivalent of hyphen, and I have seen recent discussion about whether the hyphen key on a Hebrew keyboard ought actually to generate a maqaf. No, wait. The fact that maqaf id the cultural (and visual) equivalent of a hyphen, is a good reason to *exclude* it from class Pattern_Syntax, i.e. *allow* it in identifiers, so that composite words can be used as identifier. As an ordinary Latin hyphen is already in the list, by your argument there is no reason to exclude other things that look like it and function like it. I guess that the only reason why the ASCII '-' is included in Pattern_Syntax is that it is also used as minus. If if only had the meaning hyphen, it would not be in Pattern_Syntax. _ Marco
RE: Proposed Draft UTR #31 - Syntax Characters
Peter Kirk wrote: Well, the situation with Hebrew sof pasuq is almost identical to that for Greek and Arabic question marks, except that it is functionally a full stop not a question mark, so I can't see any reason other than prejudice for omitting it from the list. Well, I had a much better reason than prejudice: ignorance. :-) That's why I told that my two lists were tentative and asked for comments. [...] - maqaf is very much the cultural equivalent of hyphen, and I have seen recent discussion about whether the hyphen key on a Hebrew keyboard ought actually to generate a maqaf. [...] I'm not talking about biblical Hebrew here, I'm talking about a living modern language. [...] And these are exactly the kinds of reasons that I was looking for: any punctuation used in modern text and/or available on keyboards are what good candidates for inclusion in Pattern_Syntax (i.e., exclusion from usage in identifiers). [...] extend the list to include all punctuation and to allow as yet undefined characters to be added to it. Well, the requirement for an invariable set seems to be part of the rules of the game with this UTR, so I'll stick to it. I guess that this requirement is due to backward compatibility issues. If version X of a certain programming language accepts identifier foo:bar (where : is a certain mark), it is not acceptable that version X+1 of the same language treats the same sequence as a syntax error: that would make existing source code in that language potentially unusable. Obviously, it must be possible to customize the default sets: most *existing* computer languages already allow in identifiers characters that would be excluded by UTR#31 (e.g.: _ - $ '). _ Marco
RE: Proposed Draft UTR #31 - Syntax Characters
Peter Kirk wrote: But the other way round is less of a problem. So I am suggesting that for now we define all punctuation characters except for those with specifically defined operator functions, also all undefined characters, as giving a syntax error. This makes it possible to define additional punctuation characters, even those in so far undefined scripts like Tifinagh, as valid operators in future versions. Yes, but this makes it impossible to use any as-yet undefined scripts in identifiers! E.g., you'd never be able to write a variable name in Tifinagh letters in future versions! Unless you are still thinking at non-fixed sets, in which case I must remind you again that there are no balls or door-keepers in a card game... :-) Ciao. Marco
RE: Proposed Draft UTR #31 - Syntax Characters
Rick McGowan wrote: the process as possible so that it can be considered The draft is found at http://www.unicode.org/reports/tr31/ and feedback can be submitted as described there. (Before submitting official feedback, I'd like to discuss my comments here. BTW, which Type of Message should I use in the feedback form? Is it OK to use Technical Report or Tech Note issues?) My two cents are both about adding characters in the Pattern_Syntax of 4.1 Proposed Pattern Properties. IMHO: 1. Full-width, half-width, and small punctuation characters should in class Pattern_Syntax as their normal width counterparts. 2. Non-Latin punctuation character should be in class Pattern_Syntax as their Latin counterparts. The rationale for suggestion 1 is that wide, narrow and small compatibility characters are substantially identical (in appearance and function) to their normal width counterparts. A parser allowing an unquoted full-width punctuation character in an identifier is guaranteed to cause confusion to the user. E.g., consider the following expression: foobar To me, it *definitely* looks like two identifiers separated by a comma, and I expect my parser to agree with me on this, even if the comma is actually a full-width comma. I am not saying that the parser must necessarily accept a full-width comma in that position: it is perfectly OK if the above expression causes a syntax error such as: Illegal character U+FF0C (FULLWIDTH COMMA) after identifier foo'. But what the parser should absolutely *not* do, IMHO, is handling foobar as a *single* identifier! Doing such a thing is guaranteed to cause troubles to me. E.g., I might receive a puzzling error message saying: Parameter missing: this statement requires 2 parameters, while I can *see* that there *are* two parameters: foo and bar... The rationale for suggestion 2 is very similar. E.g., the following expression looks a perfectly legal C++ or Java statement: return If the compiler tells me: Undeclared identifier, I may get crazy for the whole day trying to figure out what's going on... But if tells me Illegal character U+037E (GREEK QUESTION MARK) after keyword return, then I immediately understand that something is wrong with that semicolon. The reason I keep suggestions 1 and 2 separate is that, in the case of wide, narrow and small compatibility characters, it is trivial to determine the corresponding regular character, while in the case of non-Latin punctuation there is room for discussing which punctuation characters are similar enough (in function or appearance) to which Latin punctuation character. For full-width, half-width, and small punctuation characters, my suggestion is to add the following lines to 4.1 Proposed Pattern Properties: FE50..FE52 ; Pattern_Syntax # SMALL COMMA..SMALL FULL STOP FE54..FE57 ; Pattern_Syntax # SMALL SEMICOLON..SMALL EXCLAMATION MARK FE59..FE66 ; Pattern_Syntax # SMALL LEFT PARENTHESIS..SMALL EQUALS SIGN FE68..FE6B ; Pattern_Syntax # SMALL REVERSE SOLIDUS..SMALL COMMERCIAL AT FF01..FF0F ; Pattern_Syntax # FULLWIDTH EXCLAMATION MARK..FULLWIDTH SOLIDUS FF1A..FF20 ; Pattern_Syntax # FULLWIDTH COLON..FULLWIDTH COMMERCIAL AT FF3B..FF40 ; Pattern_Syntax # FULLWIDTH LEFT SQUARE BRACKET..FULLWIDTH GRAVE ACCENT FF5B..FF5E ; Pattern_Syntax # FULLWIDTH LEFT CURLY BRACKET..FULLWIDTH TILDE FF5F..FF61 ; Pattern_Syntax # FULLWIDTH LEFT WHITE PARENTHESIS..HALFWIDTH IDEOGRAPHIC FULL STOP FF64 ; Pattern_Syntax # HALFWIDTH IDEOGRAPHIC COMMA FFE0..FFE2 ; Pattern_Syntax # FULLWIDTH CENT SIGN..FULLWIDTH NOT SIGN FFE4..FFE5 ; Pattern_Syntax # FULLWIDTH BROKEN BAR..FULLWIDTH YEN SIGN FFE8..FFEE ; Pattern_Syntax # HALFWIDTH FORMS LIGHT VERTICAL..HALFWIDTH WHITE CIRCLE For non-Latin punctuation characters, this is my tentative list of characters that may cause trouble if used in identifiers, and which, consequently, should be added to class Pattern_Syntax: 037E GREEK QUESTION MARK 0387 GREEK ANO TELEIA 055C ARMENIAN EXCLAMATION MARK 055D ARMENIAN COMMA 055E ARMENIAN QUESTION MARK 0589 ARMENIAN FULL STOP 060C ARABIC COMMA 060D ARABIC DATE SEPARATOR 061B ARABIC SEMICOLON 061F ARABIC QUESTION MARK 066A ARABIC PERCENT SIGN 066B ARABIC DECIMAL SEPARATOR 066C ARABIC THOUSANDS SEPARATOR 06D4 ARABIC FULL STOP 0964 DEVANAGARI DANDA 0965 DEVANAGARI DOUBLE DANDA 10FB GEORGIAN PARAGRAPH SEPARATOR 1362 ETHIOPIC FULL STOP 1363 ETHIOPIC COMMA 1364 ETHIOPIC SEMICOLON 1365 ETHIOPIC COLON 1366 ETHIOPIC PREFACE COLON 1367 ETHIOPIC QUESTION MARK 1368 ETHIOPIC PARAGRAPH SEPARATOR 166E CANADIAN SYLLABICS FULL STOP 1802 MONGOLIAN COMMA 1803 MONGOLIAN FULL STOP 1804 MONGOLIAN COLON 1808
RE: Proposed Draft UTR #31 - Syntax Characters
Jill Ramonsky wrote: Damn. I guess you guys are all going to hate me for asking this, but ... what exactly is a mathematical space? An compatibility space character used only in typesetting mathematics: 205F;MEDIUM MATHEMATICAL SPACE;Zs;0;WS;compat 0020N; PS. I'm going to pre-empt any puns on other interpretations of the phrase mathematical space, [...] That's unfair! I had a such a nice story about a mathematician who goes in a space bar and asks for 568.31 ml of... :-) _ Marco
RE: Proposed Draft UTR #31 - Syntax Characters
Mark Davis wrote: Technical Report issues would be fine. I think #1 is worth considering. For #2, see other message to Peter Kirk. I agree with your statement: The purpose of the Pattern Syntax characters is *not* to list everything that is a symbol or punctuation mark. But that is what Kirk suggested, not what I proposed. I proposed to exclude a *limited* set of script-specific punctuation that *might* be confused with punctuation characters normally used in the syntax of computer languages, either because they look identical, or because they are perceived culturally as another form of the same character. E.g., I kept out from the list everything belonging to ancient scripts (who's going to write programs in Linear B!?) and anything that I suspected would be valid inside a word or expression: hyphens, emphasis markers, ellipsis marks, etc. You said that the list of ranges must be invariable, but I doubt that we will many new *modern* and *commonly* used punctuation marks in future versions, so think that this requirement for invariability can reasonably be maintained. I already made the example of the Greek question mark which may be mistaken for a semicolon. That is *not* an unlikely situation: if a Greek programmer has his keyboard in Greek mode (because he just finished typing an identifier containing Greek letters) he may well forget to turn it to Latin mode before typing the trailing semicolon. Similarly, due to the fact that some punctuation characters (parentheses, etc.) are mirrored in a RTL context, an Arab programmer may think that is just an alternate RTL glyph for ?, so he may be puzzled by apparently absurd error messages. E.g., he types: foo?bar And the system calls routine foo passing variable bar to it. (? is the call operator of this hypothetical programming language). So, he switches to Arabic mode and types: But the system says: Undeclared identifier. But he is *sure* that he did declare a routine named , and a variable named , so what's going on? If the system said instead: Character '' is not a legal operator, everything would be much clearer. _ Marco
RE: [Way OT] Beer measurements (was: Re: Handwritten EURO sign)
Peter Kirk wrote: [...] I guess English legs tended to be longer than Roman ones. Well, if by English you mean those Germanic barbarians who invaded Britannia, I guess that the British mile existed way before they set their feet on the island... _ Marco
RE: [Way OT] Beer measurements (was: Re: Handwritten EURO sign)
Doug Ewell wrote: Shouldn't a pint of beer be administratively fixed at 500 mL, just as a fifth of liquor in America is now officially 750 mL? Seems like a good task for an ISO working group. You could generalize it a bit: Alignment Of Metric And Imperial Units Whose Difference Is So Small As To Be Pointless. E.g., I never understood why on earth metres and yards should be kept different. In a public park somewhere in UK or Ireland I have seen the following sign: TOILETS --- 50 yds (45.72 m) It must be a really urgent need if one cares about those 3.28 metres... _ Marco
RE: [Way OT] Beer measurements (was: Re: Handwritten EURO sign)
Pim Blokland wrote: It must be a really urgent need if one cares about those 3.28 metres... 4.28 actually. Ooops. But are you serious about lengthening the yard to be the same size as the meter? I was just joking... Ha! Fat chance! You might as well suggest we abolish the yard altogether! ... What I really meant was this, in fact. Everybody understands that 50 yards on a sign for a toilet, or 1000 yards on a sign for a filling station are just rough approximation (especially since they cannot know in advance which closet or hose I will use). They just mean The toilet is quite close, resist! and Start slowing down and prepare to turn. It would be just as fine writing 50 m or 1000 m. Of course, this is if you *want* to abolish the imperial system and adopt the metre; but this *is* what UK and Ireland have decided to do. _ Marco
RE: Handwritten EURO sign (off topic?)
Anto'nio Martins-Tuva'lkin wrote: On 2003.08.06, 11:12, Philippe Verdy [EMAIL PROTECTED] wrote: the placement of the currency unit symbol or multiple is language dependant, and the same local practices are used with the euro, as the one used for pre-euro currencies. You mean that Dutch should write one euro as 1,- EUR, while Portuguese as 1EUR00, and perhaps British as EUR 1.00?... It may be the case, but I'd found that a bad idea and worth fighting against. Why? Different countries always used different characters as decimal or grouping separators for numbers. The Italian for one and a half euros is uno virgola cinquanta euro (where virgola means comma). Should we say comma and write a dot!? After all the euro is a common currency and its figures should be written in a common way. Why? In fact, the position of the currency unit and decimal separator or placement of the negative sign depends mostly of the current locale (language/region) and not on the indicated currency, so this convention is applied locally for *all* currency units. Nope, this is not true: In most cases, it is: amounts in foreign currency are normally formatted according to local conventions. E.g. a price in US$ on an Italian magazine would probably be formatted as $2.345,50, not $2,345.50 or 2,345$50¢. Using the cent sign is mostly US specific and the symbol is not recognized as such in most European countries, so the cent sign is bound directly to the dollar. [...] then I suppose there is a theoreitical possiblity that it may be used as a symbol of euro cent (though I personally prefer cEUR). The problem is not *which* symbol to use for cent: it is the concept itself that cents may need a symbol which is not familiar in most EU countries. I guess that Ireland is the only euro-zone country where you can see a price expressed in cents, such as 55 cents. In most other countries of Europe, the same amount would be expressed as 0.55 euros. _ Marco
RE: Arabic script web site hosting solution for all platforms
Philippe Verdy wrote: Excessive cross-posting to multiple newsgroups, forums and list servers is considered bulk (and also opposed to the netiquette). As this message is targetting a too large audience and out of topic, and is also a commercial ad, I can say that bulk+unsollicitated makes it fully qualifiable as SPAM. IMHO, Lateef Sagar's message was perfectly legitimate and absolutely ON TOPIC for the Unicode List. He didn't try to sell anything but simply pointed us to a demo page of his (questionable) technology. During the years, I have seen lots people coming here to announce new releases of their Unicode-related products, and no one called them spammers. This particularly true because this message calls no other response than visiting a third-party web site. This is not discussion, but a proposal of service. Although no comments or criticism was explicitly requested, the context (the Unicode List and the other forums related to encoding or fonts) and the demonstrative nature of the page pointed to may be considered an implicit request for technical comments and discussion. I will complain to Yahoo for your abuse of its AUP in this webmail. Sorry, but I have to complain with Sarasvati for your threatening manners. _ Marco
RE: Classification of U+30FC KATAKANA-HIRAGANA PROLONGED SOUND MARK
Rob Mount Q1: Can a character be both alphabetic and diacritic? I would say yes. My understanding of the Lm general category is: a diacritic letter. Q2: Is there a difinitive answer as to whether this is an alphabetic character? Strictly speaking, as katakana and hiragana are not alphabets, their letters cannot be called alphabetic. But I guess that you mean alphabetic is the sense that isalpha() should return TRUE for it, i.e. in the sense that it is a character used to write *words* in the orthography of some language. In this sense, yes, IMHO: isalpha() should return true for all the characters having general category L _ Marco
RE: Tamazight/berber language : How to send mail, write word documents ....
Chris Jacobs wrote: Depends on how much text you need. If it is just a few words then getting an unipad from http://www.unipad.org/ would be enough. You can copy and paste the chars from it. If this is not enought than have a look at http://www.tavultesoft.com/keyman/ BTW, Unipad also has its built-in keyboard editor. But, of course, that would only be usable within Unipad. _ Marco
RE: Tamazight/berber language : How to send mail, write word documents ....
Philippe Verdy wrote: However the interesting part of your question for discussion in this list is: - Which Unicode character should be used to encode the spacing ring? (may conflict with the degree sign, or a upscript small letter O) - Should you use a Greek Gamma or a Latin Gamma, and a Greek epsilon? Of course, one should avoid mixing up different scripts in the same orthography. So, I'd suggest: - abbud (epsilon): U+0190 (LATIN CAPITAL LETTER OPEN E), U+0259 (LATIN SMALL LETTER OPEN E) - arum (gamma): U+0194 (LATIN CAPITAL LETTER GAMMA), U+0263 (LATIN SMALL LETTER GAMMA) Your table also displays only the lowercase letters. It would be interesting to show the associated uppercase letters. The only dubious case in the table is uppercase arum, which could have the Greek shape. But it's unlikely. _ Marco
PUA usage (was RE: Announcement: New Unicode Savvy Logo)
[OOOPS! This works better if I set the proper MIME encoding... Sorry] Philippe Verdy wrote: This contrasts a lot with the Unicode codepoints assigned to abstract characters, that are processable out of any contextual stylesheet, font or markup system, where its only semantic is in that case private use with no linguistic semantic and no abstract character evidence, and all with the same default character properties (including shamely the bidi properties needed to render and layout the fonted text, In HTML, the default directionality of characters can be overridden with the BDO tag. E.g.: BDO dir=rtlhi!/BDO This should displays as a RTL string, with ! on the left side and h on the right side. The same can be achieved also in plain-text Unicode, using RLO, LRO and PDF: hi! (U+202E U+0068 U+0069 U+0021 U+202C) Ciao. Marco
PUA usage (was RE: Announcement: New Unicode Savvy Logo)
Philippe Verdy wrote: This contrasts a lot with the Unicode codepoints assigned to abstract characters, that are processable out of any contextual stylesheet, font or markup system, where its only semantic is in that case private use with no linguistic semantic and no abstract character evidence, and all with the same default character properties (including shamely the bidi properties needed to render and layout the fonted text, In HTML, the default directionality of characters can be overridden with the BDO tag. E.g.: BDO dir=rtlhi!/BDO This should displays as a RTL string, with ! on the left side and h on the right side. The same can be achieved also in plain-text Unicode, using RLO, LRO and PDF: ?hi!? (U+202E U+0068 U+0069 U+0021 U+202C) Ciao. Marco
RE: The role of country codes/Not snazzy
Brian Doyle wrote: on 5/29/03 9:15 AM, Marion Gunn at [EMAIL PROTECTED] wrote: When a reference to using embryonic ISO 639-3 to 'legitimize' SIL's flawed Ethnologue is let pass with no comment Why is Ethnologue flawed? And how is this more on-topic on a mailing list called Unicode List than discussing a banner offered by the _Unicode_ Consortium to flag web pages encoded in _Unicode_? _ Marco
RE: Not snazzy (was: New Unicode Savvy Logo)
Philippe Verdy wrote: Savvy is better understood in this context as aware, than archaic or informal in your English-Italian dictionnary. No, archaic, American and informal are usage labels, not translations. The translation is buon senso. (BTW, it is: Dizionario Garzanti di inglese, Garzanti Editore, 1997, ISBN 88-11-10212-X) _ Marco
RE: Not snazzy (was: New Unicode Savvy Logo)
Rick McGowan wrote: 2. It is unikely that the Unicode *logo* itself (i.e. the thing at http://www.unicode.org/webscripts/logo60s2.gif) will be incorporated directly in any image that people are allowed to put on their websites, because to put the Unicode logo on a product or whatever requires a license agreement. I.e. the submissions from E. Trager are out of scope because they contain the Unicode logo on the left side. As this comes from an Unicode official, I guess we should simply accept it... Nevertheless, I wonder whether displaying the Unicode *logo* per se has the same legal implication as displaying a *banner* which contains the Unicode logo. IMVHO, that seems like the difference between producing a T-shirt with the Unicode logo and wearing it. In the first case, I must demonstrate that I asked and obtained the permission from the trade-mark owner; in the second case, I don't have to demonstrate anything (apart, maybe, that I did not steal that piece of garment). _ Marco
RE: Not snazzy (was: New Unicode Savvy Logo)
Andrew C. West wrote: I agree with Philippe on this one. A sensible, and easily understandable, motto like The world speaks Unicode would be much better. The word savvy just sends a shiver of embarrasment down my spine. Not only is savvy not a word that is probably high in the vocabulary list of non-English speakers, but I don't think many native English speakers would ever use it by choice (maybe it's just me, but I really loathe the word). Yes, you are right. I never heard the word savvy before this morning. My English-Italian dictionary has two savvy entries: an adjective (labeled fam. amer. = US English, informal) and a noun (labeled antiq. / fam. = archaic or informal). However, all the translations have to do with common sense, and none of them seems to explain the intuitive meaning of Unicode savvy, which I guess is supposed to be: Unicode enabled, Unicode supported, encoded in Unicode, etc. Another i18n problem is the lettering: the unusual legation of the first three letters and the mix-up of upper- and lower-case forms can make the text completely unintelligible to people not familiar with handwritten forma of the Latin alphabet. I guess that many people would wonder in what strange alphabet Unicode is written COD. About the V-shaped tick in the square, that is so deformed and stylized that it might be hard to recognize. Keep in mind that this symbol is quite English-specific; in many parts of the world, different signs are used to tick squares on paper forms (e.g., X, O, a filled square, etc.). The English-style tick is only seen on GUI interfaces like Windows, Mac, etc. I also share the concerns about colors: beside their ugliness (I would have never imagined that that curious yellow could be called pink), they fail to recall the red and white of the well-know Unicode logo. If I didn't know it before seeing them, I would never have associated those icons with the Unicode standard or the Unicode Consortium. My humble suggestions would be: 1) Replace the semi-dialectal Unicode savvy with a clearer motto (such as encoded in Unicode, or the other phrases suggested by others); possibly, check that all the words used are in the high-frequency part of the English lexicon. 2) Use the regular squared Unicode logo which is seen in the top-left corner of the Unicode web site. That's already famous and immediately hints to Unicode. 3) Compose the motto (*including* the word Unicode) in an widespread and well-readable typeface, in black or un one of the colors of the Unicode logo. 4) Make the V tick sign as similar as possible to a square root symbol, because that is the glyph which has been popularized by GUI interfaces. Ciao. Marco
RE: Exciting new software release!
Doug Ewell wrote: Drop everything and check out a kewl new Windows program available at: http://users.adelphia.net/~dewell/mathtext.html 𝔬𝔱𝔣𝔩! _ Marco
RE: Several BOMs in the same file
Stefan Persson wrote: Let's say that I have two files, namely file1 file2, in any Unicode encoding, both starting with a BOM, and I compile them into one by using cat file1 file2 file3 in Unix or copy file1 + file2 file3 in MS-DOS, file3 will have the following contents: BOM contents from file1 BOM contents from file2 Is this in accordance with the Unicode standard, or do I have to remove the second BOM? IMHO, Unicode should not specify such a behavior. Deciding what a shell command is supposed to do is a decision of the operating system, not of text encoding standards. BTW, consider that both Unix cat and DOS copy are not limited to Unicode text files. Actually, they are not even limited to text files at all: you could use them to concatenate a bitmap with a font with an HTML document with a spreadsheet... whether the result makes sense or not is up to you and/or to the applications that will process the resulting file. Probably, there should be two separate commands (or different options of the same command): to do a raw byte-by-byte concatenation, and to do an encoding-aware concatenation of text files. E.g., imagine a cat command with these extensions: Synopsis cat [ -... ] [ -R encoding ] { [ -F encoding ] file } Description: ... If neither -R or -F's are specified, the concatenation is done byte by byte. Options: ... -R specifies the encoding of the resulting *text* file; -F specifies the encoding of the following *text* file. You command above would now expand to something like this: cat -R UTF-16 -F UTF-16LE file1 -F Big-5 file2 file3 Provided with information about the input encodings and the expected output encoding, cat could now correctly handle BOM's, endianness, new-line conventions, and even perform character set conversions. Without this extra info, cat would retain its good ol' byte-by-byte functionality. Similar options could be added to any Unix command potentially dealing with text files (cp, head, tail, etc.), as well as to their equivalents in DOS or other operating systems. _ Marco
RE: Several BOMs in the same file
I (Marco Cimarosti) wrote: As a minimum, option -v must know the semantics of NL and LF control codes, of the digits, and the of white space. Sorry, I meant: option -n. _ Marco
RE: Several BOMs in the same file
Kent Karlsson wrote: I'm not going into the implementation part; just pointing out that this issue is not something an operating system can ignore. cat and cp can and shall ignore it. They are octet-level file operations, attaching no semantics to the octets. Try iconv. This byte-level operation is the just the default behavior. This basic behavior should remain the default, of course. However, there already are a lot of options specific to text files, that *do* attach character semantics to octets, such as the -n option to number output lines: http://www.hmug.org/man/1/cat.html As a minimum, option -v must know the semantics of NL and LF control codes, of the digits, and the of white space. There is no technical reason for not adding more options to act more sensibly with the encoding(s) of the involved text file(s). Again, any such text-specific option must be disabled by default, in order to preserve the basic byte-by-byte operation. _ Marco
RE: List of ligatures for languages of the Indian subcontinent.
Kenneth Whistler wrote: Dream on. The information needed exists in books and other reference source in libraries, book shops, and other collections across India -- and, for that matter, around the world. It is merely a matter of collecting the relevant information and distilling it into succinct, yet complete, statements of the relevant information needed for proper typographic practice for each script, for each style of each script, for each local typographic tradition for each style, and so on. A couple of hints for William and other people interested in this issue: - Akira Nakanishi, Writing Systems of the World -- Alphabets, Syllabaries, Pictograms, Tuttle 1980(1999), ISBN 0804816549. This is charming little book explores all the scripts used in the world today, giving for each one of them a table of all the signs (apart Chinese, of course) and an explanation of how the script works. For each script, it also reproduces a page from a daily newspaper written in that scripts. The information is not always 100% accurate, however the book remains an invaluable introduction to the scripts of the world, and a perfect complement to the reading of the Unicode Standard. - The grammars in the National Integration Series by Balaji Publications, Madras, India. Each grammar in this series is a small A5-format book bearing a title like: Learn language name in 30 Days through English. The grammars are not very valid by the linguistic point of view (it's unlikely that the reader will actually learn an Indian language in one month!), but they all have a very interesting introduction to the script used by each language, which also normally includes a table of all the combinations of consonant+vowel, and a table of the essential consonant clusters, and of half or subjoined consonants. If you compare the grammars of languages sharing the same script (such as Sanskrit, Hindi, and Marathi, all written with the Devanagari script), you can verify how the list of required ligatures varies from a language to another. Notice that also these books are far from being 100% accurate. All the above books have low price and are easily found in bookshops in the UK and elsewhere. Another good source for making a lists of required glyphs are the existing non-Unicode fonts for Indic languages. The nicest free collection I have seen so far is the Akruti GNU TrueType fonts, which contains a set of glyphs appropriate for most modern usages: http://www.akruti.com/freedom/ _ Marco
RE: Need encoding conversion routines
askq1 askq1 wrote: From: Pim Blokland [EMAIL PROTECTED] However, you have said this is not what you want! So what is it that you do want? I want c/c++ code that will give me UTF8 byte sequence representing a given code-point, UTF16 16 bits sequence reppresenting a given code-point, UTF32 32 bits sequence representing a given code-point. e.g. UTF8_Sequence CodePointToUTF8(Unichar codePoint) { //I need this code } UTF16_Sequence CodePointToUTF16(Unichar codePoint) { //I need this code } UCS2_Sequence CodePointToUCS2(Unichar codePoint) { //I need this code } Hint: #include ConvertUTF.h typedef UTF32 Unichar; typedef UTF8 UTF8_Sequence [4 + 1]; typedef UTF16 UTF16_Sequence [2 + 1]; typedef UTF16 UCS2_Sequence [1 + 1]; _ Marco
RE: Need encoding conversion routines
askq1 askq1 wrote: I want c/c++ functions/routines that will convert Unicode to UTF8/UTF16/UCS2 encodings and vice-versa. Can some-one point me where can I get these code routines? Unicode's reference implementation is here, but I don't know how much up-to-date it is with some tiny changes in UTF-8: http://www.unicode.org/Public/PROGRAMS/CVTUTF/ You can also see the UTF functions provided by ICU, an open source Unicode library: http://oss.software.ibm.com/icu/ _ Marco
RE: Encoding: Unicode Quarterly Newsletter
Kenneth Whistler wrote: [...] Of course, further weight corrections need to be applied if reading the standard *below* sea level or in a deep cave. I hope it will not be consider pedantic to observe that the mass or weight of a book do not change depending on whether someone is reading it or not. Consequently, the same weight corrections need to be applied also if someone *throws* the standard in a deep cave. _ Marco