Re: U+0BA3, U+0BA9
At 02:08 PM 10/25/03 -0700, Doug Ewell wrote: > So, in effect the UNICODE character names attempt to be > a unified transliteration scheme for all languages? Are these > principles laid down somewhere or is this more informal? The Unicode character names attempt to be (a) unique and (b) reasonably mnemonic. Anything beyond that is a bonus. They expressly do *not* represent any form of transliteration or transcription scheme. However, it is sometimes forgotten that the standard is intended to be in English (with the possibility of translation to other languages, for example the French translation that has been carried out for 3.2). If a character has an obvious or common English name, that name should be used. Where there is no obvious English name, a transliteration or transcription of the native name makes sense. In the case of a script used by multiple languages, it's an interesting question which language wins out. Assume you have a majority language that doesn't use a certain character, but has a word for it. Does it make more sense to keep all transscriptions in the same language in Unicode character names? Opinions differ. Ultimately the only strong requirements are that names are unique and (recently added) that dropping common words such as LETTER, MARK, SIGN and SYMBOL as well as spaces, and hyphens do not affect that uniqueness. Since the character names freeze mistakes permanently and since the committee decisions have resulted in some odd and not always consistent approaches to naming, some of the translated sets of character names are more consistent and usable than the official English. That has led to the suggestion of eventually creating a translation of the character names into e.g. American English, essentially providing a set of consistent aliases that might be useful for dictionaries of character names exposed to end users interested in locating characers, as opposed to merely wanting the formal, but potentially arbitrary reference. A./
RE: CGJ - Combining Class Override
Sorry, Philippe, I had meant a separate character for a "right Meteg", not a separate control character. Does this mean we agree? Jony > -Original Message- > From: Philippe Verdy [mailto:[EMAIL PROTECTED] > Sent: Saturday, October 25, 2003 5:58 PM > To: Jony Rosenne > Cc: [EMAIL PROTECTED] > Subject: Re: CGJ - Combining Class Override > > > From: "Jony Rosenne" <[EMAIL PROTECTED]> > > > For the record, I repeat that I am not convinced that the CGJ is an > > appropriate solution for the problems associated with the > right Meteg. > > I tend to think we need a separate character. > > Yes, it's possible to devize another character explicitly to > override very precisely the ordering of combining classes. > But this still does not change the problem, as all the > existing NF* forms in existing documents using any past or > present version of Unicode MUST remain in NF* form with > further additions. > > If one votes for a separate control character, it should come > with precise rules describing how such override can/must be > used, so that we won't break existing implementations. This > character will necessary have a combining class 0, but will > still have a preceding context. Strict conformance for the > new NF* forms must still obey to the precise ordering rules, > and this character, whatever its form, shall not be used > everytime it is not needed, i.e. when the existing > NF* forms still produce the correct logical order (that's why > its use should then be restricted to a list of known > combining characters that may need this override). > > Call it "Combining Class Override" ? This does not > change the problem: this character should be used only > between pairs of combining characters, such as the encoded sequence: > {c1, CCO, c2} > shall conform to the rules: > (1) CC(c1) > CC(c2) > 0, > (2) c1 is known (listed by Unicode?) to require this override > to keep the logical ordering needed for correct text semantics. > > The second requirement should be made to avoid abuses of this > character. But it is not enforceable if CGJ is kept for this function. > > The CCO character should then be made "ignorable" for > collation or text breaks, so that collation keys will become: > [ CK(c1), CK(c2) ] for {c1, CCO, c2} > [ CK(c2), CK(c1) ] for {c2, c1} and {c1, c2} if normalized > > Legacy applications will detect a separate combining sequence > starting at CCO, but newer applications will still know that > both sequences are describing a single grapheme cluster. > > This knowledge should not be necessary except in grapheme > renderers, or in some input methods that will allow users to > enter: > (1) keys producing the normalized text {c2, c1} > as before; > (2) keys producing the normalized text {c1, CCO, c2} > instead of {c2, c1} as before; > (3) optionally support a keystroke or selection system to swap > combining characters. > > If this is too complex, the only way to manage the situation > is to duplicate existing combining characters that cause this > problem, and I think this may go even worse as this > duplication may need to be combinatorial and require a lot of > new codepoint assignments. > > >
Re: Traditional dollar sign
From: "Simon Butcher" <[EMAIL PROTECTED]> > Hi! > > Just a quick question.. The description for U+0024 (DOLLAR SIGN) states that the glyph may contain one or two vertical bars. Is there a codepoint specifically for the traditional double-bar form, or any plan to include one in the future? > > I was taught at school that the double-bar form was used when Australia switched to decimal currency in 1966, and that it was incorrect to write the single-bar form when referring to Australian dollars. I guess the single-bar form had taken over due to the lack of support from type-faces and computing devices, although it's still quite common to see it in Australian publications, especially in large fonts (headlines, advertising, etc). There's a similar consideration in French primary schools about the correct way to draw the decimal digits: the handwritten barred form of digit seven is mandatory to avoid confusion with the handwritten digit one, and the "uppercase L with stroke" and "zigzag" forms of digit four are also prohibited. In school books, they are shown correctly, but this rule is rapidly forgotten when children are used to correctly draw digits easy to differentiate.
Re: Traditional dollar sign
From: "Peter Kirk" <[EMAIL PROTECTED]> > I wonder how long before the Euro will also de facto have a single bar? This is already done since the birth of the symbol, when some legal texts specify that (if nothing else) a uppercase letter E can be used in environments that don't support the exact initial euro symbol design. And in fact I can see now a lot more variants of the symbols in ads and other commercial displays, using one of the many forms that have appeared for that symbol. And I am myself handwriting it sometimes with a single bar, which sometimes looks just like a tall&wide lowercase e in which the single bar touches the top right corner of a slanted curve, simply because I usually draw the horizontal stroke before this curve, forgetting to draw the second bar or drawing it too often on top of the first bar. If there are effectively semantic differences between a single-bar and double-bar glyph for the dollar in Australia, New Zealand or other countries using this symbol, and and the glyph for the US dollar, the variant may be the best solution to represent them (letting users select a font that makes this distinction). I bet it will be exceptional.
Re: Merging combining classes, was: New contribution N2676
From: "Peter Kirk" <[EMAIL PROTECTED]> > I can see that there might be some problems in the changeover phase. But > these are basically the same problems as are present anyway, and at > least putting them into a changeover phase means that they go away > gradually instead of being standardised for ever, or however long > Unicode is planned to survive for. I had already thought about it. But this may cause more troubles in the future for handling languages (like modern Hebrew) in which those combining classes are not a problem, and where the ordering of combining characters is a real bonus that would be lost if combining classes are merged, notably for full text searches where the number of order combinations to search could explode, as the effective order in occurences could become unpredictable for searches. Of course, if the combining class values were really bogous, a much simpler way would be to deprecate some existing characters, allowing new applications to use the new replacement characters, and slowly adapt the existing documents with the replacement characters whose combining classes would be more language-friendly. This last solution may seem better but only in the case where a unique combining class can be assigned to these characters. As one said previously in this list, there are languages in which such axiome will cause problems, meaning that, with the current model, those problematic combining characters would have to be encoded with a null combining class, and linked to the previous combining sequence using either a character property (for its combining behavior in grapheme clusters and for rendering) or a specific joiner control (ZWJ ?) if this property is not universal for the character. > It isn't a problem for XML etc as in such cases normalisation is > recommended but not required, thankfully. In practive, "recommanded" will mean that many processes will perform this normalization, as part of their internal job, so it would cause interoperability problems if the result of this normalization is further retreived by the unaware client that submitted the data to that service which is supposed to keep the normalization identity of the string. Also I have doubts about the validity of this change face to the stability pact signed between Unicode and the W3C for XML. > As for requirements that lists > are normalised and sorted, I would consider that a process that makes > assumptions, without checking, about data received from another process > under separate control is a process badly implemented and asking for > trouble. Here the problem is that we will not always have to manage the case of separate processes, but also the case of utility libraries: if this library is upgraded separately, the application using it may start experimenting problems. e.g. I am thinking about the implied sort order in SQL databases for table indices: what would happen if the SQL server is stopped just the time to upgrade a standard library implementing the normalization among many other services, because another security bug such as a buffer overrun is solved in another API? When restarting the SQL server with the new library implementing the new normalization, nothing would happen, apparently, but the sort order would no more be guaranteed, and stored sorted indices would start being "corrupted", in a way that would invalidate binary searches (meaning that some unique keys could become duplicated, or not found, producing unpredictable results, critical if they are assumed for, say, user authentication, or file existence). Of course such upgrade should be documented, but as this would occur in very intimate levels of a utility library incidentally used by the server. Will all administrators and programmers be able to find and know all the intimate details of this change, when Unicode has stated to them that normalized forms should never change? Will it be possible to scan and rebuild the corrupted data with a check&repair tool, if the programmers of this system assumed that the Unicode statement was definitive and allowed performing such assumptions to build optimized systems? When I read the stability pact, I can conclude from it that any text valid and normalized in one version of Unicode will remain normalized in any version of Unicode (including previous ones) provided that the normalized strings contain characters that were all defined in the previous version. This means that there's a upward _and_ backward compatibility of encoded strings and normalizations on their common defined subset (excluding only characters that have been defined in later versions but were not assigned in previous versions). The only thing that is allowed to change is the absolute value of non-zero combining classes (but in a limited way, as for now they are limited to a 8-bit value range also specified in the current stability pact with the XML working group), but not their relative order: merging neighbouring classes w
RE: Traditional dollar sign
At 11:02 AM 10/26/03 +1100, Simon Butcher wrote: Hi! > >I was taught at school that the double-bar form was used > when Australia > >switched to decimal currency in 1966, and that it was > incorrect to write > >the single-bar form when referring to Australian dollars. > > It would be interesting if you could document that. That could be tough :) Literature shown to me was at school (many years ago), and digging it up would be difficult. It's widely known that the double-bar form does exist, though, at least! But we knew that. > >I guess the single-bar form had taken over due to the lack > of support from > >type-faces and computing devices, although it's still quite > common to see > >it in Australian publications, especially in large fonts (headlines, > >advertising, etc). > > It looks like actual practice is what you describe: the free > alternation > between the form without change in meaning. > > If we were to add a code point we would get into the > situation that the > free alternation would suddenly become a matter of content > difference (not > just a choice in presentation). In other cases where the > majority of users > freely alternate, but there is indication that some subset of > users need to > maintain a form distinction we have used standardized > variants. This has > been done mostly for mathematical symbols. I understand, although couldn't that same argument be used against many of the characters in the 'Dingbats' section, such as the ornamental variations of exclamation marks, quotation marks, and so forth? I do realise these come from an existing character set, but there are indeed still users of the double-bar form. Even my Concise Oxford Dictionary is printed using the double-bar form (under the term, 'dollar'). If their font uses that other shape, that's what they get. Only when the distinction is required, (as demonstrated in actual use, not just what you get taught in school) should we disunify. I just thought it extremely odd that a character which is still in common (albeit admittedly waning) use is not included in the set. Peter Kirk made a valid observation with regards to the Lira symbol (U+20A4) which Unicode admits often has U+00A3 (Pound sign) used in its place, with the only difference being a double-bar on U+20A4. I've never seen a widely used font with both symbols in it. That alone suggests that the unification is correct. For the case of the Lira, I plead ignorance on the specific justification (and whether I would have considered it important). The fact is that the source for it is buried in the early drafts of Unicode, probably predating my involvement - so the only thing I can note is that TUS 4.0 points out that 00A3 should be used (i.e. suggests a defacto unification in recommended use). A./
RE: Traditional dollar sign
Hi! > >I was taught at school that the double-bar form was used > when Australia > >switched to decimal currency in 1966, and that it was > incorrect to write > >the single-bar form when referring to Australian dollars. > > It would be interesting if you could document that. That could be tough :) Literature shown to me was at school (many years ago), and digging it up would be difficult. It's widely known that the double-bar form does exist, though, at least! > >I guess the single-bar form had taken over due to the lack > of support from > >type-faces and computing devices, although it's still quite > common to see > >it in Australian publications, especially in large fonts (headlines, > >advertising, etc). > > It looks like actual practice is what you describe: the free > alternation > between the form without change in meaning. > > If we were to add a code point we would get into the > situation that the > free alternation would suddenly become a matter of content > difference (not > just a choice in presentation). In other cases where the > majority of users > freely alternate, but there is indication that some subset of > users need to > maintain a form distinction we have used standardized > variants. This has > been done mostly for mathematical symbols. I understand, although couldn't that same argument be used against many of the characters in the 'Dingbats' section, such as the ornamental variations of exclamation marks, quotation marks, and so forth? I do realise these come from an existing character set, but there are indeed still users of the double-bar form. Even my Concise Oxford Dictionary is printed using the double-bar form (under the term, 'dollar'). I just thought it extremely odd that a character which is still in common (albeit admittedly waning) use is not included in the set. Peter Kirk made a valid observation with regards to the Lira symbol (U+20A4) which Unicode admits often has U+00A3 (Pound sign) used in its place, with the only difference being a double-bar on U+20A4. Cheers, - Simon
Re: Unicode and Script Encoding Initiative in San Jose Mercury News
Doug Ewell wrote: [...] about "You see, boys and girls, computers think only in numbers" -- in a Silicon Valley paper, [...] Should we tell them about “real” quotes? “real quotes” are not just for Web publication; they are also for email. Throw in real dashes, of the kind – en or em – you prefer Eric. 8-)
Re: New contribution N2676
>Should we continue to encode this as ARTABE SIGN and just note the use of > this shape for 'zero' in an annotation? > Should we change it to another name and add the annotation for 'artabe'?> > Should we take any other actions? Well I don't quite know. My real intrest is in the changing shape of the zero, but I am not yet ready with a proposal. Besides in the papyri where Kenyon read Artable this symbol is much of the time coupled with another, the two written rather cursively together in the papyri. Kenyon carefully records all the different forms, and after seeing that I am in some doubt about what exactly should be encoded. I suspect that the new list is based not on the many many symbols given by Kenyon in his many volumes of transcribed papyri, but on a summary list that he published before that. I wish I could be more definite. Raymond - Original Message - From: "Asmus Freytag" <[EMAIL PROTECTED]> To: "Raymond Mercier" <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]> Sent: Saturday, October 25, 2003 8:26 PM Subject: Re: New contribution N2676 > > At 05:51 PM 10/25/03 +0100, Raymond Mercier wrote: > > Among the new characters in N2676 there is > > > > 10186 G GREEK ARTABE SIGN > > > > This is one of the many signs found in papyri, such as those edited by > >Kenyon. This symbol represents apparently a measure of volume used for > >grain. It appears as a small circle, smaller than omicron, with a long > >overline, much longer than a macron. > > > > While I have been looking for the various forms of the symbol for zero I > >find in other papyri quite exacty the same character used for 'zero'. I make > >this comparison after studying many photographs of papyri, those provided > >with Kenyon's editions on the one hand, and on the other, Alexander Jones' > >recent volume of horoscopes, Astronomical Papyri from Oxyrhynchus. > > The attached image is take from Jones, part of a column of zeroes written > >this way. > > This is fascinating information. > > However, I'm unclear what you propose. > > Should we continue to encode this as ARTABE SIGN and just note the use of > this shape for 'zero' in an annotation? > > Should we change it to another name and add the annotation for 'arabe'? > > Should we take any other actions? > > A./
Re: U+0BA3, U+0BA9
On 25/10/2003 14:08, Doug Ewell wrote: Peter Jacobi wrote: So, in effect the UNICODE character names attempt to be a unified transliteration scheme for all languages? Are these principles laid down somewhere or is this more informal? The Unicode character names attempt to be (a) unique and (b) reasonably mnemonic. Anything beyond that is a bonus. They expressly do *not* represent any form of transliteration or transcription scheme. -Doug Ewell Fullerton, California http://users.adelphia.net/~dewell/ If you think the Tamil is misleading, look at the Cyrillic. The same sound is written as I in 0415, Y in 042E and J in 0408. -- Peter Kirk [EMAIL PROTECTED] (personal) [EMAIL PROTECTED] (work) http://www.qaya.org/
Re: Unicode and Script Encoding Initiative in San Jose Mercury News
Deborah W. Anderson wrote: > The Business section in today's San Jose Mercury News (Friday, Oct. > 24) has a story on Unicode and the Script Encoding Initiative: > http://www.bayarea.com/mld/mercurynews/business/7092371.htm Nice article. Good to see some mainstream publicity for this worthy effort. My eyes rolled waaay up when I got to the part about "You see, boys and girls, computers think only in numbers" -- in a Silicon Valley paper, yet! But I guess this did appear in the Business section, not the Technology section. On the typographical dark side, it was quite discouraging to see ``this horrible quoting convention'' in a Web publication of an article about Unicode. Should we tell them about “real” quotes? -Doug Ewell Fullerton, California http://users.adelphia.net/~dewell/
Re: transliteration in java
Check out ICU4J (http://oss.software.ibm.com/icu4j/). There is a demo of transliteration at http://oss.software.ibm.com/cgi-bin/icu/tr. For Cyrillic, we currently only do an ISO-based transliteration, but you can do your own custom ones. (The demo will store custom rules that people have devised. I see that there are a couple of Cyrillic ones, as well as a number of ones we don't have in the stock ICU, such as American/Canadian Indian transliterators.) Mark__http://www.macchiato.com► शिष्यादिच्छेत्पराजयम् ◄ - Original Message - From: Dennis N. Stetsenko To: [EMAIL PROTECTED] Sent: Sat, 2003 Oct 25 11:25 Subject: transliteration in java Hello My apologies if such kind of question is too silly, but I browse quickly through resources\FAQ and did not find anything useful for me… I’m having bunch of files that are in Cyrillic charset and I need to transfer then to some device that is not capable to show such carset (don’t have appropriate font). So, I’ve decided to provide transliteration mechanism, i.e. convert chars from Cyrillic to Latin. The language that I’m going to use is Java. Can you guys point me on some useful resource to do so or give me some recommendation? = I’ve made some preliminary prototyping, and results appear to be weird. 1 I provide a mapping from a char (lets say Cyrillic) to its Latin equivalent in sense of transliteration 2 Take the flat file and process it (convert from Cyrillic to Latin) Sometimes its working, sometimes its not… Apparently when I run simple things from my IDE it works fine, but when I’m trying to do the same in standalone mode – it skips processing. I was hunting down the problem and this is the difference I see: When I do call like this Character.UnicodeBlock.of(toProcess) for next char to transliterate, it shows From IDE - CYRILLIC Standalone - LATIN_1_SUPPLEMENT So, I guess the way flat file is read makes big difference… I’m willing to blame some difference in system properties settings for to such calls… Can you help me with pointers to make it the way it should be? Thanks, Dennis
Re: U+0BA3, U+0BA9
Peter Jacobi wrote: > So, in effect the UNICODE character names attempt to be > a unified transliteration scheme for all languages? Are these > principles laid down somewhere or is this more informal? The Unicode character names attempt to be (a) unique and (b) reasonably mnemonic. Anything beyond that is a bonus. They expressly do *not* represent any form of transliteration or transcription scheme. -Doug Ewell Fullerton, California http://users.adelphia.net/~dewell/
Re: New contribution N2676
At 05:51 PM 10/25/03 +0100, Raymond Mercier wrote: Among the new characters in N2676 there is 10186 G GREEK ARTABE SIGN This is one of the many signs found in papyri, such as those edited by Kenyon. This symbol represents apparently a measure of volume used for grain. It appears as a small circle, smaller than omicron, with a long overline, much longer than a macron. While I have been looking for the various forms of the symbol for zero I find in other papyri quite exacty the same character used for 'zero'. I make this comparison after studying many photographs of papyri, those provided with Kenyon's editions on the one hand, and on the other, Alexander Jones' recent volume of horoscopes, Astronomical Papyri from Oxyrhynchus. The attached image is take from Jones, part of a column of zeroes written this way. This is fascinating information. However, I'm unclear what you propose. Should we continue to encode this as ARTABE SIGN and just note the use of this shape for 'zero' in an annotation? Should we change it to another name and add the annotation for 'arabe'? Should we take any other actions? A./
transliteration in java
Hello My apologies if such kind of question is too silly, but I browse quickly through resources\FAQ and did not find anything useful for me… I’m having bunch of files that are in Cyrillic charset and I need to transfer then to some device that is not capable to show such carset (don’t have appropriate font). So, I’ve decided to provide transliteration mechanism, i.e. convert chars from Cyrillic to Latin. The language that I’m going to use is Java. Can you guys point me on some useful resource to do so or give me some recommendation? = I’ve made some preliminary prototyping, and results appear to be weird. 1 I provide a mapping from a char (lets say Cyrillic) to its Latin equivalent in sense of transliteration 2 Take the flat file and process it (convert from Cyrillic to Latin) Sometimes its working, sometimes its not… Apparently when I run simple things from my IDE it works fine, but when I’m trying to do the same in standalone mode – it skips processing. I was hunting down the problem and this is the difference I see: When I do call like this Character.UnicodeBlock.of(toProcess) for next char to transliterate, it shows From IDE - CYRILLIC Standalone - LATIN_1_SUPPLEMENT So, I guess the way flat file is read makes big difference… I’m willing to blame some difference in system properties settings for to such calls… Can you help me with pointers to make it the way it should be? Thanks, Dennis
Re: New contribution N2676
At 02:29 +0200 2003-10-25, Philippe Verdy wrote: 0659 ARABIC ZWARAKAY . Pashto Why not ARABIC MACRON ? Well, Zwarakay may be appropriate if this is the transliterated Arabic name. It isn't a macron. It's a zwarakay, and that's a Pashto name. -- Michael Everson * * Everson Typography * * http://www.evertype.com
Re: Traditional dollar sign
On 25/10/2003 10:16, Asmus Freytag wrote: At 03:36 AM 10/26/03 +1100, Simon Butcher wrote: Just a quick question.. The description for U+0024 (DOLLAR SIGN) states that the glyph may contain one or two vertical bars. Is there a codepoint specifically for the traditional double-bar form, or any plan to include one in the future? No. I was taught at school that the double-bar form was used when Australia switched to decimal currency in 1966, and that it was incorrect to write the single-bar form when referring to Australian dollars. It would be interesting if you could document that. I guess the single-bar form had taken over due to the lack of support from type-faces and computing devices, although it's still quite common to see it in Australian publications, especially in large fonts (headlines, advertising, etc). It looks like actual practice is what you describe: the free alternation between the form without change in meaning. If we were to add a code point we would get into the situation that the free alternation would suddenly become a matter of content difference (not just a choice in presentation). In other cases where the majority of users freely alternate, but there is indication that some subset of users need to maintain a form distinction we have used standardized variants. This has been done mostly for mathematical symbols. In theory, this could be done here as well, but any thoughts in that direction would need to be preceded by clear and compelling evidence of an actual requirement. The case of an official preference that has never been widely adhered to -- which is what you have described -- would probably not qualify as grounds for taking any action. A./ The situation seems very similar to that for U+20A4 vs. U+00A3. I was taught at school in the UK, and I guess Australians were taught before 1966, to write the pound sign with two bars like U+20A4, and in fact I still usually do so in handwriting. But today the single-barred version is much more common in print in the UK. And the notes for U+20A4 suggest that this became true also in Italy, before the Euro was introduced. I wonder how long before the Euro will also de facto have a single bar? -- Peter Kirk [EMAIL PROTECTED] (personal) [EMAIL PROTECTED] (work) http://www.qaya.org/
Re: Merging combining classes, was: New contribution N2676
Philippe Verdy wrote: The problem with this solution is that stability is not guaranteed across backward versions of Unicode: if a tool A implements the new version of combining classes and normalizes its input, it will keep the relative ordering of characters. If its output is injected into a tool B that still uses the legacy classes, the tool B may either reject the input (not normalized) or force the normalization. Then is the text comes back to tool A, it will see a modified text. Wouldn't it be possible to, if this is of any importance in a specific situation, specify a Unicode version, and not utilise additional normalisation data that is only specified in later versions than the specified version? For example, x = normalise("some text", 4.0); normalises the text according to the rules specified in Unicode 4.0, or, if the software has not yet been updated with this information, according to the rules in an earlier version of Unicode, while x = normalise("some text"); would normalise the text according to the most recent version of Unicode for which the "normalise" program has any data. Stefan
Re: New contribution N2676
Among the new characters in N2676 there is 10186 G GREEK ARTABE SIGN This is one of the many signs found in papyri, such as those edited by Kenyon. This symbol represents apparently a measure of volume used for grain. It appears as a small circle, smaller than omicron, with a long overline, much longer than a macron. While I have been looking for the various forms of the symbol for zero I find in other papyri quite exacty the same character used for 'zero'. I make this comparison after studying many photographs of papyri, those provided with Kenyon's editions on the one hand, and on the other, Alexander Jones' recent volume of horoscopes, Astronomical Papyri from Oxyrhynchus. The attached image is take from Jones, part of a column of zeroes written this way. Raymond Mercier > - Original Message - > From: "Michael Everson" <[EMAIL PROTECTED]> > To: <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]> > Sent: Friday, October 24, 2003 7:36 PM > Subject: New contribution N2676 > > > > > > A new contribution: > > N2676 > > Repertoire additions from meeting 44 > > Asmus Freytag > > 2003-10-23 > > http://www.dkuug.dk/jtc1/sc2/wg2/docs/n2676.pdf > > > > -- > > Michael Everson * * Everson Typography * * http://www.evertype.com > <>
Re: Traditional dollar sign
At 03:36 AM 10/26/03 +1100, Simon Butcher wrote: Just a quick question.. The description for U+0024 (DOLLAR SIGN) states that the glyph may contain one or two vertical bars. Is there a codepoint specifically for the traditional double-bar form, or any plan to include one in the future? No. I was taught at school that the double-bar form was used when Australia switched to decimal currency in 1966, and that it was incorrect to write the single-bar form when referring to Australian dollars. It would be interesting if you could document that. I guess the single-bar form had taken over due to the lack of support from type-faces and computing devices, although it's still quite common to see it in Australian publications, especially in large fonts (headlines, advertising, etc). It looks like actual practice is what you describe: the free alternation between the form without change in meaning. If we were to add a code point we would get into the situation that the free alternation would suddenly become a matter of content difference (not just a choice in presentation). In other cases where the majority of users freely alternate, but there is indication that some subset of users need to maintain a form distinction we have used standardized variants. This has been done mostly for mathematical symbols. In theory, this could be done here as well, but any thoughts in that direction would need to be preceded by clear and compelling evidence of an actual requirement. The case of an official preference that has never been widely adhered to -- which is what you have described -- would probably not qualify as grounds for taking any action. A./
Re: Merging combining classes, was: New contribution N2676
On 25/10/2003 09:11, Philippe Verdy wrote: From: "Peter Kirk" <[EMAIL PROTECTED]> ... The problem would then be the interoperability of Unicode-compliant systems using distinct versions of Unicode (for example between XML processors, text editors, input methods, renderers, text converters, full text search engines. This may even be critical in tools like sorting, in applications that require and expect that their input is sorted according to its locale in a predictable way (for example in applications using binary searches in sorted lists of text items, such as authentication in a list of user names, or a filenames index). I can see that there might be some problems in the changeover phase. But these are basically the same problems as are present anyway, and at least putting them into a changeover phase means that they go away gradually instead of being standardised for ever, or however long Unicode is planned to survive for. It isn't a problem for XML etc as in such cases normalisation is recommended but not required, thankfully. As for requirements that lists are normalised and sorted, I would consider that a process that makes assumptions, without checking, about data received from another process under separate control is a process badly implemented and asking for trouble. -- Peter Kirk [EMAIL PROTECTED] (personal) [EMAIL PROTECTED] (work) http://www.qaya.org/
Traditional dollar sign
Hi! Just a quick question.. The description for U+0024 (DOLLAR SIGN) states that the glyph may contain one or two vertical bars. Is there a codepoint specifically for the traditional double-bar form, or any plan to include one in the future? I was taught at school that the double-bar form was used when Australia switched to decimal currency in 1966, and that it was incorrect to write the single-bar form when referring to Australian dollars. I guess the single-bar form had taken over due to the lack of support from type-faces and computing devices, although it's still quite common to see it in Australian publications, especially in large fonts (headlines, advertising, etc). Cheers! - Simon
Re: Merging combining classes, was: New contribution N2676
From: "Peter Kirk" <[EMAIL PROTECTED]> > I wonder if it would in fact be possible to merge certain adjacent > combining classes, as from a future numbered version N of the standard. > That would not affect the normalisation of existing text; text > normalised before version N would remain normalised in version N and > later, although not vice versa. I know that this would break the letter > of the current stability policy, but is this kind of backward > compatibility actually necessary? The change could be sold to others as > required for the internal consistency of Unicode. The problem with this solution is that stability is not guaranteed across backward versions of Unicode: if a tool A implements the new version of combining classes and normalizes its input, it will keep the relative ordering of characters. If its output is injected into a tool B that still uses the legacy classes, the tool B may either reject the input (not normalized) or force the normalization. Then is the text comes back to tool A, it will see a modified text. One could argue that a CCO control may be generated when converting for backwards versions of Unicode. But will tool A know the version of Unicode used by legacy tool B, if B is a remote service that does not provide this version information to A? The problem would then be the interoperability of Unicode-compliant systems using distinct versions of Unicode (for example between XML processors, text editors, input methods, renderers, text converters, full text search engines. This may even be critical in tools like sorting, in applications that require and expect that their input is sorted according to its locale in a predictable way (for example in applications using binary searches in sorted lists of text items, such as authentication in a list of user names, or a filenames index).
Re: CGJ - Combining Class Override
From: "Jony Rosenne" <[EMAIL PROTECTED]> > For the record, I repeat that I am not convinced that the CGJ is an > appropriate solution for the problems associated with the right Meteg. I > tend to think we need a separate character. Yes, it's possible to devize another character explicitly to override very precisely the ordering of combining classes. But this still does not change the problem, as all the existing NF* forms in existing documents using any past or present version of Unicode MUST remain in NF* form with further additions. If one votes for a separate control character, it should come with precise rules describing how such override can/must be used, so that we won't break existing implementations. This character will necessary have a combining class 0, but will still have a preceding context. Strict conformance for the new NF* forms must still obey to the precise ordering rules, and this character, whatever its form, shall not be used everytime it is not needed, i.e. when the existing NF* forms still produce the correct logical order (that's why its use should then be restricted to a list of known combining characters that may need this override). Call it "Combining Class Override" ? This does not change the problem: this character should be used only between pairs of combining characters, such as the encoded sequence: {c1, CCO, c2} shall conform to the rules: (1) CC(c1) > CC(c2) > 0, (2) c1 is known (listed by Unicode?) to require this override to keep the logical ordering needed for correct text semantics. The second requirement should be made to avoid abuses of this character. But it is not enforceable if CGJ is kept for this function. The CCO character should then be made "ignorable" for collation or text breaks, so that collation keys will become: [ CK(c1), CK(c2) ] for {c1, CCO, c2} [ CK(c2), CK(c1) ] for {c2, c1} and {c1, c2} if normalized Legacy applications will detect a separate combining sequence starting at CCO, but newer applications will still know that both sequences are describing a single grapheme cluster. This knowledge should not be necessary except in grapheme renderers, or in some input methods that will allow users to enter: (1) keys producing the normalized text {c2, c1} as before; (2) keys producing the normalized text {c1, CCO, c2} instead of {c2, c1} as before; (3) optionally support a keystroke or selection system to swap combining characters. If this is too complex, the only way to manage the situation is to duplicate existing combining characters that cause this problem, and I think this may go even worse as this duplication may need to be combinatorial and require a lot of new codepoint assignments.
CGJ
For the record, I repeat that I am not convinced that the CGJ is an appropriate solution for the problems associated with the right Meteg. I tend to think we need a separate character. Jony > -Original Message- > From: [EMAIL PROTECTED] > [mailto:[EMAIL PROTECTED] On Behalf Of Philippe Verdy > Sent: Saturday, October 25, 2003 1:12 PM > To: Peter Kirk > Cc: [EMAIL PROTECTED] > Subject: Re: New contribution N2676 > > > From: "Peter Kirk" <[EMAIL PROTECTED]> > > Have combining classes actually been defined for these characters? > > > > This is of course exactly the same problem as with Hebrew > vowel points > > and accents, except that this time it applies to real living > > languages. Perhaps it is time to do something about these combining > > classes which conflict with the standard. > > Do you mean officially documenting the correct (and strict) > use of CGJ as the only way to bypass the default order > required by the combining classes in normalized forms? It > would be a good idea to document officially which use of CGJ > is superfluous and should be avoided in NF forms, and which > use is required. > > 1) This will affect only the input methods for those > languages that need to "swap" the standard order of combining > characters to keep their logical order (all this will require > is a additional input control that will try swapping > ambiguous orders). > > 2) A complete documentation may need to specify which pairs > of combining characters are affected (this should list the > pairs of combining characters where CC(c1) > CC(c2) > and that require to be encoded to be kept in > logical order, as the sequence will be reordered > into in normalized forms. > > 3) The other issue would be that there may exist other > combining characters than those in this pair. Suppose I want > to represent , where CC(c1) > CC(c2), but > c3 does not have a conflicting pair in the previous list. > Should it be encoded as or as c1, c3, CGJ, c2>? As the standard normalization algorithm > cannot be changed, both sequences will be possible with the > NF forms, even though they represent the same character. > > One could design an extra normalization step to force one > interpretation (so that only combining characters with > conflicting combining classes that have been forced "swapped" > will appear after CGJ, all other diacritics being encoded > preferably in the first sequence before the CGJ). > > This extra step should not be part of the NF forms (because > Unicode states that normailzed forms will be kept normalized > in all further versions of Unicode), but this could be named > differently, by describing a system in which extra > normalization steps may be applied that may change NF forms > into other "equivalent" sequences also in normalized form. > > > >
Re: unicode on Linux
Jungshik Shin wrote: > the applications do not expect UTF-8, for instance > ls sorts alphabetically but dot not know Unicode sorting). Does 'ls' sort filenames when they're in ISO-8859-1? My "ls", using the sv_SE.ISO-8859-1 locale, properly sorts file names alphabetically. Stefan
Re: unicode on Linux
Stephane Bortzmeyer wrote: > Kernel > 1) File names in Unicode: no (well, the Linux kernel is 8-bits clean > so you can always encode in UTF-8, but the kernel does not do any > normalization As other have written, I don't think kernel has any business with normalization (although on Mac OS X, apparently the kernel does). > the applications do not expect UTF-8, for instance > ls sorts alphabetically but dot not know Unicode sorting). Does 'ls' sort filenames when they're in ISO-8859-1? > 2) User names: worse since utilities to create an account refuses > UTF-8. Yeah, this should be fixed. > Applications > > 3) grep: no Unicode regexp I agree that grep and many other text utilities need to be updated to honor the locale (LC_COLLATE, LC_CTYPE and others). With glibc 2.2.x or later and gnulib, it shouldn't be as hard as before. In addition, you always have perl and python to turn to (both support Unicode very well). Also note that I wrote about 'honoring the locale' instead of supporting UTF-8, by which I want to emphasize that it's not just UTF-8 but also legacy character encodings that are not supported by grep and other GNU textutils used on Linux. > 4) xterm (or similar virtual terminals): No BiDi support at all mlterm does. It even supports Indic scripts. (xterm supports Thai script and Korean script, though). Do you have any terminal emulator running on other platforms that do BIDI well? > 5) shells: I'm not aware of any line-editing shell (zsh, tcsh) > that have Unicode character semantics (back-character should move one > character, not one byte) A recent version of bash (to be precise, GNU libreadline it uses) has no problem with UTF-8 handling (although it does not do well with combining character sequences. that is, it doesn't have a notion of grapheme clusters) > 6) databases: I'm not aware of a free DBMS which has support for > Unicode sorting (SQL's ORDER BY) or regexps (SQL's LIKE). Why is the OS to blame that there's no FREE DBMS that supports Unicode collation and regular expression? Needless to say, there are commerical DBMS' that do both and run on Linux. > 7) Serious word processing: LaTeX has only very minimum Unicode Well, Linux distributions come not only with LaTeX/TeX but also with Lambda/Omega, their Unicode cousins. Opentype font support in Omega/Lambda is not there, yet, but Indic scripts and other complex scripts (e.g. Korean script) can be typeset with Omega/Lambda. Anyway, LaTeX/Lambda are not for word processing. If you want a word processor, you have to try openoffice/staroffice, Abiword, kwrite, and so forth that support Unicode well. > Also, many applications (exmh, emacs) are ten times slower when > running in UTF-8 mode. Emacs' adoption of Unicode has been moving frustratingly slow and the performance may be slower in UTF-8 mode than otherwise(actually, there are a couple of diffent UTF-8 implementations for Emacs and I don't know which one you tried), but Vim is not. The reasom Emacs is that much slower is likely to do with the fact UTF-8 support is retrofitted to the ISO-2022-based infrastructure of MULE. Other applications on Linux do NOT have to carry that baggage so that they are not any slower in UTF-8 mode than in legacy encoding. Actually, they should be faster in UTF-8 because most modern toolkits/applications for Linux are based on Unicode and in UTF-8, there's no (if UTF-8 is the internal representation as in gtk) or little (if UTF-16 is used internally as in Qt) overhead for the codeset conversion. Pls, don't extrapolate from just a couple of bad examples. > At the present time, using Unicode on Unix is an act of faith. Well, I thought this is 2003. You wrote as if it's 2000. You sound like a one-time 'convert' who lost one's faith a long time ago and has never come back to see how much has changed since. Moreover, in the above sentence, that you used 'Unix' instead of Linux, Sun and IBM engineers who worked on UTF-8 locale support on Solaris and AIX may take an offense at your remark. I can't say much about AIX except that it has supported UTF-8 locales as long as Solaris has. As for Solaris, Solaris 7 (released in mid-1990's) and onward don't even have some remaining problems Linux still have (i.e. grep/sed/ls/sort and other textutils not honoring the locale in their handling of text streams). >> Default charset for recent versions of some popular distributions. > > > Yes, RedHat changed the default charset to Unicode without thinking > that text files were no longer readable. Unreadable? What is iconv(1) for? Perhaps, RH should have included a nice GUI migration tool (as a part of the RH 8/9 installation disk)to let clueless end users(Mom and Pop) convert all their text files in legacy encodings to UTF-8 along with a similar tool for the filename conversion. I'm not saying that using Unicode (mostly in the form of UTF-8) on Linux is as seamless as I wish it to be (there are a number of issues I wan
Merging combining classes, was: New contribution N2676
On 25/10/2003 04:11, Philippe Verdy wrote: From: "Peter Kirk" <[EMAIL PROTECTED]> Have combining classes actually been defined for these characters? This is of course exactly the same problem as with Hebrew vowel points and accents, except that this time it applies to real living languages. Perhaps it is time to do something about these combining classes which conflict with the standard. Do you mean officially documenting the correct (and strict) use of CGJ as the only way to bypass the default order required by the combining classes in normalized forms? It would be a good idea to document officially which use of CGJ is superfluous and should be avoided in NF forms, and which use is required. This isn't what I meant, but I agree that some such definition would be a good idea. What I had in mind was a probably hopeless plea for the wrongly assigned combining classes to be corrected. After all, the current assignments manifestly breach the standard, because marks with different classes interact typographically. I wonder if it would in fact be possible to merge certain adjacent combining classes, as from a future numbered version N of the standard. That would not affect the normalisation of existing text; text normalised before version N would remain normalised in version N and later, although not vice versa. I know that this would break the letter of the current stability policy, but is this kind of backward compatibility actually necessary? The change could be sold to others as required for the internal consistency of Unicode. If this were possible, the Hebrew and Arabic problem could be partly solved, in a non-optimal way but one which is less messy than the current situation. The idea would be for all Hebrew marks (i.e. all combining marks in 05B0-05C2) to be merged into one combining class, and similarly all Arabic harakat etc. including the new Arabic tone signs. This would make significant the relative orderings of multiple vowels (and meteg), and avoid the need for CGJ hacks. It would also allow the logical order of shadda, dagesh and sin and shin dots to be the canonical one, with significant advantages for collation etc as well as for rendering. -- Peter Kirk [EMAIL PROTECTED] (personal) [EMAIL PROTECTED] (work) http://www.qaya.org/
Re: New contribution N2676
From: "Peter Kirk" <[EMAIL PROTECTED]> > Have combining classes actually been defined for these characters? > > This is of course exactly the same problem as with Hebrew vowel points > and accents, except that this time it applies to real living languages. > Perhaps it is time to do something about these combining classes which > conflict with the standard. Do you mean officially documenting the correct (and strict) use of CGJ as the only way to bypass the default order required by the combining classes in normalized forms? It would be a good idea to document officially which use of CGJ is superfluous and should be avoided in NF forms, and which use is required. 1) This will affect only the input methods for those languages that need to "swap" the standard order of combining characters to keep their logical order (all this will require is a additional input control that will try swapping ambiguous orders). 2) A complete documentation may need to specify which pairs of combining characters are affected (this should list the pairs of combining characters where CC(c1) > CC(c2) and that require to be encoded to be kept in logical order, as the sequence will be reordered into in normalized forms. 3) The other issue would be that there may exist other combining characters than those in this pair. Suppose I want to represent , where CC(c1) > CC(c2), but c3 does not have a conflicting pair in the previous list. Should it be encoded as or as ? As the standard normalization algorithm cannot be changed, both sequences will be possible with the NF forms, even though they represent the same character. One could design an extra normalization step to force one interpretation (so that only combining characters with conflicting combining classes that have been forced "swapped" will appear after CGJ, all other diacritics being encoded preferably in the first sequence before the CGJ). This extra step should not be part of the NF forms (because Unicode states that normailzed forms will be kept normalized in all further versions of Unicode), but this could be named differently, by describing a system in which extra normalization steps may be applied that may change NF forms into other "equivalent" sequences also in normalized form.
Re: U+0BA3, U+0BA9
Hi Kenneth, All, Thank you for the quick clarification of matters. Kenneth Whistler <[EMAIL PROTECTED]> wrote: > U+0BA3 TAMIL LETTER NNA is the retroflex n, usually transliterated > as n-underdot . which is N UofKöln transliteration, I assume. > U+0BA9 TAMIL LETTER NNNA is the distinct alveolar n, usually > transliterated as n-macronbelow . which is n2 UofKöln transliteration, I assume. > The 10646 naming conventions, which are stuck with A-Z for > transliteration, generally use doubled letters to indicate > retroflex consonants, particular for Indic languages. When > a third distinction needs to be made, as for Tamil, the > third name occasionally just gets a tripled letter, as is > the case for U+0BA9. So, in effect the UNICODE character names attempt to be a unified transliteration scheme for all languages? Are these principles laid down somewhere or is this more informal? > TSCII naming conventions may differ. I assume the TSCII authors got the UNICODE names mixed up, as Tamil is not short of differing transliteration scheme already before seeing the UNICODE one. Regards, Peter Jacobi -- NEU FÜR ALLE - GMX MediaCenter - für Fotos, Musik, Dateien... Fotoalbum, File Sharing, MMS, Multimedia-Gruß, GMX FotoService Jetzt kostenlos anmelden unter http://www.gmx.net +++ GMX - die erste Adresse für Mail, Message, More! +++
Re: New contribution N2676
On 24/10/2003 18:09, Kenneth Whistler wrote: ... Incidentally, the characters U+065A..U+065C are all tonal diacritics for African languages written in the Arabic script. They should not be confused with the similar shaped diacritics which are part of the extended letters of Arabic. The tones can be stacked on Arabic letters which already have letter diacritics as part of their shapes. Are they also potentially stacked with Arabic vowel signs (harakat)? If so, they interact with them typographically. And the standard specifies that they should therefore have the same combining classes as the harakat. The problem is, the harakat which appear in the same position have different combining classes. And if x<>y, there is no z such that z=x and z=y. So it is impossible to define these new characters in a way which does not conflict with the standard. Have combining classes actually been defined for these characters? This is of course exactly the same problem as with Hebrew vowel points and accents, except that this time it applies to real living languages. Perhaps it is time to do something about these combining classes which conflict with the standard. -- Peter Kirk [EMAIL PROTECTED] (personal) [EMAIL PROTECTED] (work) http://www.qaya.org/